|When the checkpoint request arrives, the procress is notified of the request before the checkpoint is taken.|
|After a checkpoint has successfully completed, the same process as the checkpoint is notified of its successful continuation of execution.|
|After a checkpoint has successfully completed, a new / restarted process is notified of its successful restart.|
In order for a process to use the Open PAL CRS components it must adhear to a few programmatic requirements.
First, the program must call OPAL_INIT early in its execution. This should only be called once, and it is not possible to checkpoint the process without it first having called this function.
The program must call OPAL_FINALIZE before termination. This does a significant amount of cleanup. If it is not called, then it is very likely that remnants are left in the filesystem.
To checkpoint and restart a process you must use the Open PAL tools to do so. Using the backend checkpointers checkpoint and restart tools will lead to undefined behavior. To checkpoint a process use opal_checkpoint (opal_checkpoint(1)). To restart a process use opal_restart (opal_restart(1)).
Open PAL ships with two CRS components: self and blcr.
The following MCA parameters apply to all components:
crs_base_verbose Set the verbosity level for all components. Default is 0, or silent except on error.
The self component invokes user-defined functions to save and restore checkpoints. It is simply a mechanism for user-defined functions to be invoked at Open PALs Checkpoint, Continue, and Restart phases. Hence, the only data that is saved during the checkpoint is what is written in the users checkpoint function. No libary state is saved at all.
As such, the model for the self component is slightly differnt than for other components. Specifically, the Restart function is not invoked in the same process image of the process that was checkpointed. The Restart phase is invoked during OPAL_INIT of the new instance of the applicaiton (i.e., it starts over from main()).
The self component has the following MCA parameters:
crs_self_prefix Speficy a string prefix for the name of the checkpoint, continue, and restart functions that Open PAL will invoke during the respective stages. That is, by specifying "-mca crs_self_prefix foo" means that Open PAL expects to find three functions at run-time:
By default, the prefix is set to "opal_crs_self_user".
crs_self_priority Set the self components default priority crs_self_verbose Set the verbosity level. Default is 0, or silent except on error. crs_self_do_restart This is mostly internally used. A general user should never need to set this value. This is set to non-0 when a the new process should invoke the restart callback in OPAL_INIT. Default is 0, or normal execution.
The Berkeley Lab Checkpoint/Restart (BLCR) single-process checkpoint is a software system developed at Lawrence Berkeley National Laboratory. See the project website for more details:
The blcr component has the following MCA parameters:
crs_blcr_priority Set the blcr components default priority. crs_blcr_verbose Set the verbosity level. Default is 0, or silent except on error.
The none component simply selects no CRS component. All of the CRS function calls return immediately with OPAL_SUCCESS.
This component is the last component to be selected by default. This means that if another component is available, and the none component was not explicity requested then OPAL will attempt to activate all of the available components before falling back to this component.
|1.10.2||OPAL_CRS (7)||Jan 21, 2016|