10.7. Monitoring standby disconnections on the primary node

10.7.1. Standby disconnections monitoring process and criteria
10.7.2. Standby disconnections monitoring process example
10.7.3. Standby disconnections monitoring caveats
10.7.4. Standby disconnections monitoring process configuration
10.7.5. Standby disconnections monitoring process event notifications

Note

This functionality is available in ltcluster 5 and later.

When running on the primary node, ltclusterd can monitor connections and in particular disconnections by its attached child nodes (standbys, and if in use, the witness server), and optionally execute a custom command if certain criteria are met (such as the number of attached nodes falling to zero following a failover to a new primary); this command can be used for example to "fence" the node and ensure it is isolated from any applications attempting to access the replication cluster.

Note

Currently ltclusterd can only detect disconnections of streaming replication standbys and cannot determine whether a standby has disconnected and fallen back to archive recovery.

See section caveats below.

10.7.1. Standby disconnections monitoring process and criteria

ltclusterd monitors attached child nodes and decides whether to invoke the user-defined command based on the following process and criteria:

  • Every few seconds (defined by the configuration parameter child_nodes_check_interval; default: 5 seconds, a value of 0 disables this altogether), ltclusterd queries the pg_stat_replication system view and compares the nodes present there against the list of nodes registered with ltcluster which should be attached to the primary.

    If a witness server is in use, ltclusterd connects to it and checks which upstream node it is following.

  • If a child node (standby) is no longer present in pg_stat_replication, ltclusterd notes the time it detected the node's absence, and additionally generates a child_node_disconnect event.

    If a witness server is in use, and it is no longer following the primary, or not reachable at all, ltclusterd notes the time it detected the node's absence, and additionally generates a child_node_disconnect event.

  • If a child node (standby) which was absent from pg_stat_replication reappears, ltclusterd clears the time it detected the node's absence, and additionally generates a child_node_reconnect event.

    If a witness server is in use, which was previously not reachable or not following the primary node, has become reachable and is following the primary node, ltclusterd clears the time it detected the node's absence, and additionally generates a child_node_reconnect event.

  • If an entirely new child node (standby or witness) is detected, ltclusterd adds it to its internal list and additionally generates a child_node_new_connect event.

  • If the child_nodes_disconnect_command parameter is set in ltcluster.conf, ltclusterd will then loop through all child nodes. If it determines that insufficient child nodes are connected, and a minimum of child_nodes_disconnect_timeout seconds (default: 30) has elapsed since the last node became disconnected, ltclusterd will then execute the child_nodes_disconnect_command script.

    By default, the child_nodes_disconnect_command will only be executed if all child nodes are disconnected. If child_nodes_connected_min_count is set, the child_nodes_disconnect_command script will be triggered if the number of connected child nodes falls below the specified value (e.g. if set to 2, the script will be triggered if only one child node is connected). Alternatively, if child_nodes_disconnect_min_count and more than that number of child nodes disconnects, the script will be triggered.

    Note

    By default, a witness node, if in use, will not be counted as a child node for the purposes of determining whether to execute child_nodes_disconnect_command.

    To enable the witness node to be counted as a child node, set child_nodes_connected_include_witness in ltcluster.conf to true (and reload the configuration if ltclusterd is running).

  • Note that child nodes which are not attached when ltclusterd starts will not be considered as missing, as ltclusterd cannot know why they are not attached.

10.7.2. Standby disconnections monitoring process example

This example shows typical ltclusterd log output from a three-node cluster (primary and two child nodes), with child_nodes_connected_min_count set to 2.

ltclusterd on the primary has started up, while two child nodes are being provisioned:

[2019-04-24 15:25:33] [INFO] monitoring primary node "node1" (ID: 1) in normal state
[2019-04-24 15:25:35] [NOTICE] new node "node2" (ID: 2) has connected
[2019-04-24 15:25:35] [NOTICE] 1 (of 1) child nodes are connected, but at least 2 child nodes required
[2019-04-24 15:25:35] [INFO] no child nodes have detached since ltclusterd startup
(...)
[2019-04-24 15:25:44] [NOTICE] new node "node3" (ID: 3) has connected
[2019-04-24 15:25:46] [INFO] monitoring primary node "node1" (ID: 1) in normal state
(...)

One of the child nodes has disconnected; ltclusterd is now waiting child_nodes_disconnect_timeout seconds before executing child_nodes_disconnect_command:

[2019-04-24 15:28:11] [INFO] monitoring primary node "node1" (ID: 1) in normal state
[2019-04-24 15:28:17] [INFO] monitoring primary node "node1" (ID: 1) in normal state
[2019-04-24 15:28:19] [NOTICE] node "node3" (ID: 3) has disconnected
[2019-04-24 15:28:19] [NOTICE] 1 (of 2) child nodes are connected, but at least 2 child nodes required
[2019-04-24 15:28:19] [INFO] most recently detached child node was 3 (ca. 0 seconds ago), not triggering "child_nodes_disconnect_command"
[2019-04-24 15:28:19] [DETAIL] "child_nodes_disconnect_timeout" set To 30 seconds
(...)

child_nodes_disconnect_command is executed once:

[2019-04-24 15:28:49] [INFO] most recently detached child node was 3 (ca. 30 seconds ago), triggering "child_nodes_disconnect_command"
[2019-04-24 15:28:49] [INFO] "child_nodes_disconnect_command" is:
	"/usr/bin/fence-all-the-things.sh"
[2019-04-24 15:28:51] [NOTICE] 1 (of 2) child nodes are connected, but at least 2 child nodes required
[2019-04-24 15:28:51] [INFO] "child_nodes_disconnect_command" was previously executed, taking no action

10.7.3. Standby disconnections monitoring caveats

The follwing caveats should be considered if you are intending to use this functionality.

  • If a child node is configured to use archive recovery, it's possible that the child node will disconnect from the primary node and fall back to archive recovery. In this case ltclusterd will nevertheless register a node disconnection.

  • ltcluster relies on application_name in the child node's primary_conninfo string to be the same as the node name defined in the node's ltcluster.conf file. Furthermore, this application_name must be unique across the replication cluster.

    If a custom application_name is used, or the application_name is not unique across the replication cluster, ltcluster will not be able to reliably monitor child node connections.

10.7.4. Standby disconnections monitoring process configuration

The following parameters, set in ltcluster.conf, control how child node disconnection monitoring operates.

child_nodes_check_interval

Interval (in seconds) after which ltclusterd queries the pg_stat_replication system view and compares the nodes present there against the list of nodes registered with ltcluster which should be attached to the primary.

Default is 5 seconds, a value of 0 disables this check altogether.

child_nodes_disconnect_command

User-definable script to be executed when ltclusterd determines that an insufficient number of child nodes are connected. By default the script is executed when no child nodes are executed, but the execution threshold can be modified by setting one of child_nodes_connected_min_count orchild_nodes_disconnect_min_count (see below).

The child_nodes_disconnect_command script can be any user-defined script or program. It must be able to be executed by the system user under which the LightDB server itself runs (usually lightdb).

Note

If child_nodes_disconnect_command is not set, no action will be taken.

If specified, the following format placeholder will be substituted when executing child_nodes_disconnect_command:

%p

ID of the node executing the child_nodes_disconnect_command script.

The child_nodes_disconnect_command script will only be executed once while the criteria for its execution are met. If the criteria for its execution are no longer met (i.e. some child nodes have reconnected), it will be executed again if the criteria for its execution are met again.

The child_nodes_disconnect_command script will not be executed if ltclusterd is paused.

child_nodes_disconnect_timeout

If ltclusterd determines that an insufficient number of child nodes are connected, it will wait for the specified number of seconds to execute the child_nodes_disconnect_command.

Default: 30 seconds.

child_nodes_connected_min_count

If the number of child nodes connected falls below the number specified in this parameter, the child_nodes_disconnect_command script will be executed.

For example, if child_nodes_connected_min_count is set to 2, the child_nodes_disconnect_command script will be executed if one or no child nodes are connected.

Note that child_nodes_connected_min_count overrides any value set in child_nodes_disconnect_min_count.

If neither of child_nodes_connected_min_count or child_nodes_disconnect_min_count are set, the child_nodes_disconnect_command script will be executed when no child nodes are connected.

A witness node, if in use, will not be counted as a child node unless child_nodes_connected_include_witness is set to true.

child_nodes_disconnect_min_count

If the number of disconnected child nodes exceeds the number specified in this parameter, the child_nodes_disconnect_command script will be executed.

For example, if child_nodes_disconnect_min_count is set to 2, the child_nodes_disconnect_command script will be executed if more than two child nodes are disconnected.

Note that any value set in child_nodes_disconnect_min_count will be overriden by child_nodes_connected_min_count.

If neither of child_nodes_connected_min_count or child_nodes_disconnect_min_count are set, the child_nodes_disconnect_command script will be executed when no child nodes are connected.

A witness node, if in use, will not be counted as a child node unless child_nodes_connected_include_witness is set to true.

child_nodes_connected_include_witness

Whether to count the witness node (if in use) as a child node when determining whether to execute child_nodes_disconnect_command.

Default to false.

10.7.5. Standby disconnections monitoring process event notifications

The following event notifications may be generated:

child_node_disconnect

This event is generated after ltclusterd detects that a child node is no longer streaming from the primary node.

Example:

$ ltcluster cluster event --event=child_node_disconnect
 Node ID | Name  | Event                 | OK | Timestamp           | Details
---------+-------+-----------------------+----+---------------------+--------------------------------------------
 1       | node1 | child_node_disconnect | t  | 2019-04-24 12:41:36 | node "node3" (ID: 3) has disconnected

child_node_reconnect

This event is generated after ltclusterd detects that a child node has resumed streaming from the primary node.

Example:

$ ltcluster cluster event --event=child_node_reconnect
 Node ID | Name  | Event                | OK | Timestamp           | Details
---------+-------+----------------------+----+---------------------+------------------------------------------------------------
 1       | node1 | child_node_reconnect | t  | 2019-04-24 12:42:19 | node "node3" (ID: 3) has reconnected after 42 seconds

child_node_new_connect

This event is generated after ltclusterd detects that a new child node has been registered with ltcluster and has connected to the primary.

Example:

$ ltcluster cluster event --event=child_node_new_connect
 Node ID | Name  | Event                  | OK | Timestamp           | Details
---------+-------+------------------------+----+---------------------+---------------------------------------------
 1       | node1 | child_node_new_connect | t  | 2019-04-24 12:41:30 | new node "node3" (ID: 3) has connected

child_nodes_disconnect_command

This event is generated after ltclusterd detects that sufficient child nodes have been disconnected for a sufficient amount of time to trigger execution of the child_nodes_disconnect_command.

Example:

$ ltcluster cluster event --event=child_nodes_disconnect_command
 Node ID | Name  | Event                          | OK | Timestamp           | Details
---------+-------+--------------------------------+----+---------------------+--------------------------------------------------------
 1       | node1 | child_nodes_disconnect_command | t  | 2019-04-24 13:08:17 | "child_nodes_disconnect_command" successfully executed