As emphasised previously, performing a switchover is a non-trivial operation and there are a number of potential issues which can occur. While ltcluster attempts to perform sanity checks, there's no guaranteed way of determining the success of a switchover without actually carrying it out.
ltcluster may abort a switchover with a message like:
ERROR: shutdown of the primary server could not be confirmed HINT: check the primary server status before performing any further actions
This means the shutdown of the old primary has taken longer than ltcluster expected, and it has given up waiting.
In this case, check the LightDB log on the primary server to see what is going
on. It's entirely possible the shutdown process is just taking longer than the
timeout set by the configuration parameter shutdown_check_timeout
(default: 60 seconds), in which case you may need to adjust this parameter.
Note that shutdown_check_timeout
is set on the node where
ltcluster standby switchover
is executed (promotion candidate); setting it on the
demotion candidate (former primary) will have no effect.
If the primary server has shut down cleanly, and no other node has been promoted, it is safe to restart it, in which case the replication cluster will be restored to its original configuration.
ltcluster may abort a switchover with a message like:
ERROR: unable to perform a switchover while primary server is in exclusive backup mode HINT: stop backup before attempting the switchover
This means an exclusive backup is running on the current primary; interrupting this will not only abort the backup, but potentially leave the primary with an ambiguous backup state.
To proceed, either wait until the backup has finished, or cancel it with the command
SELECT pg_stop_backup()
. For more details see the LightDB
documentation section
Making an exclusive low level backup.