10.5. Failover validation

10.5. Failover validation
Prev	Up	Chapter 10. Automatic failover with ltclusterd	Home	Next

From ltcluster 5, ltcluster makes it possible to provide a script to ltclusterd which, in a failover situation, will be executed by the promotion candidate (the node which has been selected to be the new primary) to confirm whether the node should actually be promoted.

To use this, failover_validation_command in ltcluster.conf to a script executable by the lightdb system user, e.g.:

      failover_validation_command=/path/to/script.sh %n

The %n parameter will be replaced with the node ID when the script is executed. A number of other parameters are also available, see section "Optional configuration for automatic failover" for details.

This script must return an exit code of 0 to indicate the node should promote itself. Any other value will result in the promotion being aborted and the election rerun. There is a pause of election_rerun_interval seconds before the election is rerun.

Sample ltclusterd log file output during which the failover validation script rejects the proposed promotion candidate:

[2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2)
[2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command"
[2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2
[2019-03-13 21:01:30] [INFO] output returned by failover validation command:
Node ID: 2

[2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1"
[2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun
[2019-03-13 21:01:30] [INFO] 1 followers to notify
[2019-03-13 21:01:30] [NOTICE] notifying node "node3" (ID: 3) to rerun promotion candidate selection
INFO:  node 3 received notification to rerun promotion candidate election
[2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval")

Prev	Up	Next
10.4. Standby disconnection on failover	Home	10.6. ltclusterd and cascading replication