To demonstrate automatic failover, set up a 3-node replication cluster (one primary and two standbys streaming directly from the primary) so that the cluster looks something like this:
$ ltcluster -f /etc/ltcluster.conf cluster show --compact ID | Name | Role | Status | Upstream | Location | Prio. ----+-------+---------+-----------+----------+----------+------- 1 | node1 | primary | * running | | default | 100 2 | node2 | standby | running | node1 | default | 100 3 | node3 | standby | running | node1 | default | 100
See section Required configuration for automatic failover
for an example of minimal ltcluster.conf
file settings suitable for use with ltclusterd.
Start ltclusterd on each standby and verify that it's running by examining the
log output, which at log level INFO
will look like this:
[2019-08-15 07:14:42] [NOTICE] ltclusterd (ltclusterd 5.0) starting up [2019-08-15 07:14:42] [INFO] connecting to database "host=node2 dbname=ltcluster user=ltcluster connect_timeout=2" INFO: set_repmgrd_pid(): provided pidfile is /var/run/ltcluster/ltclusterd-12.pid [2019-08-15 07:14:42] [NOTICE] starting monitoring of node "node2" (ID: 2) [2019-08-15 07:14:42] [INFO] monitoring connection to upstream node "node1" (ID: 1)
Each ltclusterd should also have recorded its successful startup as an event:
$ ltcluster -f /etc/ltcluster.conf cluster event --event=ltclusterd_start Node ID | Name | Event | OK | Timestamp | Details ---------+-------+---------------+----+---------------------+-------------------------------------------------------- 3 | node3 | ltclusterd_start | t | 2019-08-15 07:14:42 | monitoring connection to upstream node "node1" (ID: 1) 2 | node2 | ltclusterd_start | t | 2019-08-15 07:14:41 | monitoring connection to upstream node "node1" (ID: 1) 1 | node1 | ltclusterd_start | t | 2019-08-15 07:14:39 | monitoring cluster primary "node1" (ID: 1)
Now stop the current primary server with e.g.:
lt_ctl -D /var/lib/lightdb/data -m immediate stop
This will force the primary to shut down straight away, aborting all processes
and transactions. This will cause a flurry of activity in the ltclusterd log
files as each ltclusterd detects the failure of the primary and a failover
decision is made. This is an extract from the log of a standby server (node2
)
which has promoted to new primary after failure of the original primary (node1
).
[2019-08-15 07:27:50] [WARNING] unable to connect to upstream node "node1" (ID: 1) [2019-08-15 07:27:50] [INFO] checking state of node 1, 1 of 3 attempts [2019-08-15 07:27:50] [INFO] sleeping 5 seconds until next reconnection attempt [2019-08-15 07:27:55] [INFO] checking state of node 1, 2 of 3 attempts [2019-08-15 07:27:55] [INFO] sleeping 5 seconds until next reconnection attempt [2019-08-15 07:28:00] [INFO] checking state of node 1, 3 of 3 attempts [2019-08-15 07:28:00] [WARNING] unable to reconnect to node 1 after 3 attempts [2019-08-15 07:28:00] [INFO] primary and this node have the same location ("default") [2019-08-15 07:28:00] [INFO] local node's last receive lsn: 0/900CBF8 [2019-08-15 07:28:00] [INFO] node 3 last saw primary node 12 second(s) ago [2019-08-15 07:28:00] [INFO] last receive LSN for sibling node "node3" (ID: 3) is: 0/900CBF8 [2019-08-15 07:28:00] [INFO] node "node3" (ID: 3) has same LSN as current candidate "node2" (ID: 2) [2019-08-15 07:28:00] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds [2019-08-15 07:28:00] [NOTICE] promotion candidate is "node2" (ID: 2) [2019-08-15 07:28:00] [NOTICE] this node is the winner, will now promote itself and inform other nodes [2019-08-15 07:28:00] [INFO] promote_command is: "/usr/pgsql-12/bin/ltcluster -f /etc/ltcluster/12/ltcluster.conf standby promote" NOTICE: promoting standby to primary DETAIL: promoting server "node2" (ID: 2) using "/usr/pgsql-12/bin/lt_ctl -w -D '/var/lib/pgsql/12/data' promote" NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete NOTICE: STANDBY PROMOTE successful DETAIL: server "node2" (ID: 2) was successfully promoted to primary [2019-08-15 07:28:01] [INFO] 3 followers to notify [2019-08-15 07:28:01] [NOTICE] notifying node "node3" (ID: 3) to follow node 2 INFO: node 3 received notification to follow node 2 [2019-08-15 07:28:01] [INFO] switching to primary monitoring mode [2019-08-15 07:28:01] [NOTICE] monitoring cluster primary "node2" (ID: 2)
The cluster status will now look like this, with the original primary (node1
)
marked as inactive, and standby node3
now following the new primary
(node2
):
$ ltcluster -f /etc/ltcluster.conf cluster show --compact ID | Name | Role | Status | Upstream | Location | Prio. ----+-------+---------+-----------+----------+----------+------- 1 | node1 | primary | - failed | | default | 100 2 | node2 | primary | * running | | default | 100 3 | node3 | standby | running | node2 | default | 100
ltcluster cluster event
will display a summary of
what happened to each server during the failover:
$ ltcluster -f /etc/ltcluster.conf cluster event Node ID | Name | Event | OK | Timestamp | Details ---------+-------+----------------------------+----+---------------------+------------------------------------------------------------- 3 | node3 | ltclusterd_failover_follow | t | 2019-08-15 07:38:03 | node 3 now following new upstream node 2 3 | node3 | standby_follow | t | 2019-08-15 07:38:02 | standby attached to upstream node "node2" (ID: 2) 2 | node2 | ltclusterd_reload | t | 2019-08-15 07:38:01 | monitoring cluster primary "node2" (ID: 2) 2 | node2 | ltclusterd_failover_promote | t | 2019-08-15 07:38:01 | node 2 promoted to primary; old primary 1 marked as failed 2 | node2 | standby_promote | t | 2019-08-15 07:38:01 | server "node2" (ID: 2) was successfully promoted to primary