|o||A configurable recovery policy.|
|o||A configurable time interval for health check operations.|
|o||A notification via signal before recovery action is taken.|
|o||A mechanism to indicate to the application the number of times an active process has been created by the SAM server.|
|o||Both application driven health checking and event driven health checking.|
The SAM library is initialized by sam_initialize(3). sam_initalize(3) may only be called once per process. Calling it more then once has undefined results and is not recommended or tested.
User configurable signal (default SIGTERM) is sent to the application when a recovery action is planned. The application can use the signal(3) system call to monitor for this signal.
There are no special constraints on what SAM apis may be called in a warning callback. After time_interval expires, a SIGKILL signal is sent to the active process to force its termination.
The active process is registered with SAM by calling sam_register(3). This function should only be called one time in a process. After a recovery action is taken, the new active process will begin execution at the next line of code in a user process after sam_register(3).
Two types of healthchecking are available to the user. The first model is one where the user application healthchecks during its normal operation. It is never requested to healtcheck, and if the active process doesnt respond within the time interval, the process will be restarted.
A more useful mechanism for healthchecking is event driven healthchecking. Because this model is directed by the SAM server, It isnt necessary to guess or add timers to the active process to signal a healthcheck operation is successful. To use event driven healthchecking, the sam_hc_callback_register(3) function should be executed.
SAM has special policies (SAM_RECOVERY_POLICY_QUIT and SAM_RECOVERY_POLICY_RESTART) for integration with quorum service. This policies changes SAM behaviour in two aspects.
o Call of sam_start(3) blocks until corosync becomes quorate o User selected recovery action is taken immediately after lost of quorum.
Sometimes there is need to store some data, which survives between instances. One can in such case use files, databases, ... or much simpler in memory solution presented by sam_data_store(3), sam_data_restore(3) and sam_data_getsize(3) functions.
SAM has policy flag used for confdb system integration (SAM_RECOVERY_POLICY_CONFDB). If process is registered with this flag, new confdb object PROCESS_NAME:PID is created with following keys:
o recovery - will be quit or restart depending on policy o poll_period - period of health checking in milliseconds o last_updated - Timestamp (in nanoseconds) of the last health check. o state - state of process (can be one of registered, started, failed, waiting for quorum)
Object is automatically deleted if process exits with stopped health checking.
Confdb integration with corosync wathdog can be used in implicit and explicit way.
Implicit way is achieved by setting recovery policy to QUIT and let process exit with started health checking. If this happened, object is not deleted and corosync watchdog will take required action.
Explicit way is usefull for situations, when developer can deal with some non-fatal fall of application. This mode is achieved by setting policy to RESTART and using SAM same as without Confdb integration. If real fail is needed (like too many restarts at all, per/sec, ...), its possible to use sam_mark_failed(3) and let corosync watchdog take required action.
sam_initialize(3), sam_data_getsize(3), sam_data_restore(3), sam_data_store(3), sam_finalize(3), sam_mark_failed(3), sam_start(3), sam_stop(3), sam_register(3), sam_warn_signal_set(3), sam_hc_send(3), sam_hc_callback_register(3)
|corosync Man Page||SAM_OVERVIEW (8)||21/05/2010|