In normal operation, ltclusterd monitors the state of the LightDB node it is running on, and will take appropriate action if problems are detected, e.g. (if so configured) promote the node to primary, if the existing primary has been determined as failed.
However, ltclusterd is unable to distinguish between planned outages (such as performing a switchover or installing LightDB maintenance released), and an actual server outage. In versions prior to ltcluster 4.2 it was necessary to stop ltclusterd on all nodes (or at least on all nodes where ltclusterd is configured for automatic failover) to prevent ltclusterd from making unintentional changes to the replication cluster.
For major LightDB upgrades, ltclusterd should be shut down completely and only started up once the ltcluster packages for the new LightDB major version have been installed.
In order to be able to pause/unpause ltclusterd, following prerequisites must be met:
pause
/unpause
operation is executed, using the
conninfo
string shown by ltcluster cluster show
.
These conditions are required for normal ltcluster operation in any case.
To pause ltclusterd, execute ltcluster service pause
(ltcluster 4.2 - 4.4: ltcluster daemon pause
),
e.g.:
$ ltcluster -f /etc/ltcluster.conf service pause NOTICE: node 1 (node1) paused NOTICE: node 2 (node2) paused NOTICE: node 3 (node3) paused
The state of ltclusterd on each node can be checked with
ltcluster service status
(ltcluster 4.2 - 4.4: ltcluster daemon status
),
e.g.:
$ ltcluster -f /etc/ltcluster.conf service status ID | Name | Role | Status | ltclusterd | PID | Paused? ----+-------+---------+---------+---------+------+--------- 1 | node1 | primary | running | running | 7851 | yes 2 | node2 | standby | running | running | 7889 | yes 3 | node3 | standby | running | running | 7918 | yes
If executing a switchover with ltcluster standby switchover
,
ltcluster will automatically pause/unpause the ltclusterd service as part of the switchover process.
If the primary (in this example, node1
) is stopped, ltclusterd
running on one of the standbys (here: node2
) will react like this:
[2019-08-28 12:22:21] [WARNING] unable to connect to upstream node "node1" (node ID: 1) [2019-08-28 12:22:21] [INFO] checking state of node 1, 1 of 5 attempts [2019-08-28 12:22:21] [INFO] sleeping 1 seconds until next reconnection attempt ... [2019-08-28 12:22:24] [INFO] sleeping 1 seconds until next reconnection attempt [2019-08-28 12:22:25] [INFO] checking state of node 1, 5 of 5 attempts [2019-08-28 12:22:25] [WARNING] unable to reconnect to node 1 after 5 attempts [2019-08-28 12:22:25] [NOTICE] node is paused [2019-08-28 12:22:33] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state [2019-08-28 12:22:33] [DETAIL] ltclusterd paused by administrator [2019-08-28 12:22:33] [HINT] execute "ltcluster service unpause" to resume normal failover mode
If the primary becomes available again (e.g. following a software upgrade), ltclusterd will automatically reconnect, e.g.:
[2019-08-28 12:25:41] [NOTICE] reconnected to upstream node "node1" (ID: 1) after 8 seconds, resuming monitoring
To unpause the ltclusterd service, execute
ltcluster service unpause
((ltcluster 4.2 - 4.4: ltcluster daemon unpause
),
e.g.:
$ ltcluster -f /etc/ltcluster.conf service unpause NOTICE: node 1 (node1) unpaused NOTICE: node 2 (node2) unpaused NOTICE: node 3 (node3) unpaused
If the previous primary is no longer accessible when ltclusterd
is unpaused, no failover action will be taken. Instead, a new primary must be manually promoted using
ltcluster standby promote
,
and any standbys attached to the new primary with
ltcluster standby follow
.
This is to prevent execution of ltcluster service unpause
resulting in the automatic promotion of a new primary, which may be a problem particularly
in larger clusters, where ltclusterd could select a different promotion
candidate to the one intended by the administrator.
The pause state of each node will be stored over a LightDB restart.
ltcluster service pause
and
ltcluster service unpause
can be
executed even if ltclusterd is not running; in this case,
ltclusterd will start up in whichever pause state has been set.
ltcluster service pause
and
ltcluster service unpause
do not start/stop ltclusterd.
The commands ltcluster daemon start
and ltcluster daemon stop
(if correctly configured) can be used to start/stop
ltclusterd on individual nodes.