As mentioned in the previous section, success of the switchover operation depends on ltcluster being able to shut down the current primary server quickly and cleanly.
Ensure that the promotion candidate has sufficient free walsenders available
(LightDB configuration item max_wal_senders
), and if replication
slots are in use, at least one free slot is available for the demotion candidate (
LightDB configuration item max_replication_slots
).
Ensure that a passwordless SSH connection is possible from the promotion candidate
(standby) to the demotion candidate (current primary). If --siblings-follow
will be used, ensure that passwordless SSH connections are possible from the
promotion candidate to all nodes attached to the demotion candidate
(including the witness server, if in use).
ltcluster expects to find the ltcluster binary in the same path on the remote server as on the local server.
Double-check which commands will be used to stop/start/restart the current
primary; this can be done by e.g. executing ltcluster node service
on the current primary:
ltcluster -f /etc/ltcluster.conf node service --list-actions --action=stop ltcluster -f /etc/ltcluster.conf node service --list-actions --action=start ltcluster -f /etc/ltcluster.conf node service --list-actions --action=restart
These commands can be defined in ltcluster.conf
with
service_start_command
, service_stop_command
and service_restart_command
.
If ltcluster is installed from a package. you should set these commands to use the appropriate service commands defined by the package/operating system as these will ensure LightDB is stopped/started properly taking into account configuration and log file locations etc.
If the service_*_command
options aren't defined, ltcluster will
fall back to using lt_ctl to stop/start/restart
LightDB, which may not work properly, particularly when executed on a remote
server.
For more details, see service command settings.
On systemd
systems we strongly recommend using the appropriate
systemctl
commands (typically run via sudo
) to ensure
systemd
is informed about the status of the LightDB service.
If using sudo
for the systemctl
calls, make sure the
sudo
specification doesn't require a real tty for the user. If not set
this way, ltcluster
will fail to stop the primary.
Check that access from applications is minimalized or preferably blocked completely, so applications are not unexpectedly interrupted.
If an exclusive backup is running on the current primary, or if WAL replay is paused on the standby, ltcluster will not perform the switchover.
Check there is no significant replication lag on standbys attached to the current primary.
If WAL file archiving is set up, check that there is no backlog of files waiting
to be archived, as LightDB will not finally shut down until all of these have been
archived. If there is a backlog exceeding archive_ready_warning
WAL files,
ltcluster will emit a warning before attempting to perform a switchover; you can also check
manually with ltcluster node check --archive-ready
.
Finally, consider executing ltcluster standby switchover
with the
--dry-run
option; this will perform any necessary checks and inform you about
success/failure, and stop before the first actual command is run (which would be the shutdown of the
current primary). Example output:
$ ltcluster standby switchover -f /etc/ltcluster.conf --siblings-follow --dry-run NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode INFO: SSH connection to host "node1" succeeded INFO: archive mode is "off" INFO: replication lag on this standby is 0 seconds INFO: all sibling nodes are reachable via SSH NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby INFO: following shutdown command would be run on node "node1": "lt_ctl -l /var/log/lightdb/startup.log -D '/var/lib/lightdb/data' -m fast -W stop" INFO: parameter "shutdown_check_timeout" is set to 60 seconds
Be aware that --dry-run
checks the prerequisites
for performing the switchover and some basic sanity checks on the
state of the database which might effect the switchover operation
(e.g. replication lag); it cannot however guarantee the switchover
operation will succeed. In particular, if the current primary
does not shut down cleanly, ltcluster will not be able to reliably
execute the switchover (as there would be a danger of divergence
between the former and new primary nodes).
See ltcluster standby switchover for a full list of available
command line options and ltcluster.conf
settings relevant
to performing a switchover.
If the demotion candidate does not shut down smoothly or cleanly, there's a risk it will have a slightly divergent timeline and will not be able to attach to the new primary. To fix this situation without needing to reclone the old primary, it's possible to use the lt_rewind utility, which will usually be able to resync the two servers.
To have ltcluster execute lt_rewind if it detects this
situation after promoting the new primary, add the --force-rewind
option.
If ltcluster detects a situation where it needs to execute lt_rewind,
it will execute a CHECKPOINT
on the new primary before executing
lt_rewind.
For more details on lt_rewind, see: https://www.hs.net/lightdb/docs/html/app-pgrewind.html.