3.3. Reliability and availability

3.3. Reliability and availability
Prev	Up	Chapter 3. Introducing Lightdb Features	Home	Next

3.3.1. Connection

What to check when troubleshooting connectivity

Is the database server reachable?
- telnet <host> <port>, or
- nc -zv <host> <port>
- Use traceroute (Unix/Linux) or tracert (Windows), specifying the protocol allowed by the host and intermediary routers.
Are the host and port correct?
Is the server running?
- lt_ctl status
Does the server-side firewall allow communication through the port?
Does the client-side firewall allow communication to the server port?
Does lt_hba.conf have any entry that allow the combination of SSL/non-SSL, client host, database and user?
Is the listen_addresses parameter configured to allow connection through the desired IP addresses, including IPv4 and/or IPv6?
Are the database, user name, and password correct?
Does the user have permission to connect to the database?
- Check privileges with ltsql's \l or pg_database.datacl
Does the database server have enough CPU and memory resources?
Isn't the maximum connection limit reached?
- max_connections and superuser_reserved_connections parameters (at instance level)
- CREATE/ALTER DATABASE CONNECTION LIMIT (at database level)
- CREATE/ALTER ROLE CONNECTION LIMIT (at user level)

Connection termination and query cancellation

When the connection is closed, any incomplete transaction is rolled back.
Terminating a connection (pg_terminate_backend()) and canceling a query (pg_cancel_backend()) does not always work. For example, they don't work while the backend process is running in an uninterruptible section, such as waiting to acquire a lightweight lock, a read/write system call against a network storage device, and a loop without a cancellation point.
Set statement_timeout at appropriate levels (statement, user, database, instance). A short timeout is not recommended at a wide level because it cancels intentional long-running queries.
Set client-side timeouts appropriately.
Set server-side timeouts appropriately.
- tcp_keepalives_idle, tcp_keepalives_interval, tcp_keepalives_count
  - TCP keep-alive works while the TCP connection is idle. It does not work when the socket connection is being established, or some data has been sent and waiting for its ACK.
  - The effective timeout is tcp_keepalives_idle + tcp_keepalives_interval * tcp_keepalives_count.
- tcp_user_timeout
  - Sets the TCP retransmission timeout. Relatively newly available on Linux.
  - Comes to the rescue when TCP keep-alive doesn't help. i.e., when the socket connection is being established, or some data has been sent and waiting for its ACK.
  - Confusing when used together with TCP keep-alive because this changes when TCP keep-alive times out. It would be safe to set tcp_user_timeout to tcp_keepalives_idle + tcp_keepalives_interval * tcp_keepalives_count.
- authentication_timeout
- idle_in_transaction_session_timeout
- idle_session_timeout
- client_connection_check_interval

Connection failover

Client drivers allow multiple hosts to be specified in the connection string.
- Connection timeout is applied to each host in the connection string. Therefore, making a connection may take unexpectedly long if there are many failed hosts before the running one in the list.
Restore session state after failover: session variables, prepared statements, temporary tables, holdable cursors (created with DECLARE CURSOR WITH HOLD), advisory locks, session user (set with SET SESSION AUTHORIZATION), current user (set with SET ROLE).

3.3.2. WAL

What WAL is for

Crash recovery, archive recovery (Point-In-Time-Recovery: PITR), and replication
Updates are redone regardless of whether the transaction was committed.
There is no undo log (before image) or operation, unlike other popular DBMSs.
- Therefore, transaction rollback and crash recovery is fast.
- Changes by an aborted transaction are left in memory and on disk, but they are invisible to other transactions thanks to MVCC.

WAL structure

A sequence of 8 KB blocks.
Each block can contain multiple WAL records. Also, each WAL record can span multiple blocks.
The content is the same both in memory and on storage.
The WAL buffer is a contiguous array of blocks in memory. It's used in a circular fashion.
The WAL on storage is divided into WAL segment files. Each WAL segment file is 16 MB by default, which is configurable with lt_initdb's --wal-segsize=size.

Writing WAL

In memory, modify the data pages in shared buffers, and then write the change into the WAL buffer.
WAL buffer is always written to WAL files sequentially (no random write).
Before writing a dirty data page in a shared buffer out to disk, first all WAL records up to the latest one that affected the data page. This rule is the WAL (Write Ahead Log).
- Each data page has, in its page header, the location (LSN) of the WAL record that represents the latest update to it. This is the page LSN.
- LSN (Log Sequence Number): an unsigned 8-byte integer. It represents the WAL segment, block, and an offset in that block.
If writing to the WAL file fails, the instance will crash with a PANIC message.
WAL volume can grow beyond max_wal_size due to the reasons such as:
- Heavy writes like loading data with COPY
- failure to archive WAL files
- the large value of wal_keep_size
- an unused replication slot
SELECT can modify data pages and write WAL when:
- acquiring row locks, e.g., SELECT FOR UPDATE. They set xmax in the tuple header, and could possibly update MultiXact data structures.
- pruning line pointers and defragmenting the page.
- setting hint bits to tuple headers. WAL is emitted when page checksums are enabled.

3.3.3. Transaction

ACID: what they are attributed to

Atomicity: transaction rollback and database recovery
Consistency: integrity constraints and triggers, such as non-NULL, check, primary key/unique/foreign key constraints
Isolation: MVCC and locks
Durability: WAL

Transaction ID (XID)

A transaction is assigned an XID when it first modifies data, such as in INSERT, UPDATE, DELETE, and SELET FOR SHARE/UPDATE.
- XID assignments are serialized with XidGen LWLock.
- XID assignment is usually very fast, but it might sometimes experience hiccups. It allocates and zeros a new commit log (clog) page through SLRU cache every 32K transactions. That clog page allocation could possibly flush a dirty page for page replacement.
- This could cause an unpredictable spike of response time.
Read-only transactions do not assign an XID. They are free from the LWLock contention for assigning a new XID.
XIDs are stored in tuple headers and visible as xmin and xmax system columns (SELECT xmin, xmax, * FROM mytable).
- xmin is the XID of a transaction that created the tuple (INSERT, UPDATE, COPY).
- xmax is the XID of a transaction that either:
  - deleted the tuple (DELETE, UPDATE).
  - locked the tuple (e.g., SELECT FOR SHARE/UPDATE)
Special XID values
- 0: invalid XID
- 1: bootstrap XID. Used by bootstrap processing during lt_initdb.
- 2: Frozen XID: Recent versions of Lightdb only use this for sequence tuples, not for tables.

MVCC: Multi-Version Concurrency Control

The major advantage of MVCC is "reading never blocks writing and writing never blocks reading." i.e., UPDATE/DELETE and SELECT on the same row do not block each other.
- Writes to the same row block each other.
- In the traditional lock-based concurrency control, read and write on the same row conflict.
How it works overall:
- Insert and update to a row create a new version of the row. Update leaves the old row version for other running transactions. (Multi-version)
- XID of the creating transaction is set to xmin field in the new row version.
- The new row version is only visible to its creating transaction until it commits.
- Once the creating transaction commits, all new subsequent transactions will be able to see the new row version. Other existing transactions continue to see the old row version. The old row version is a "dead tuple" now.
- Delete to a row does not remove the row version. It sets the XID of the deleting transaction to xmax field in the row version.
- The deleted row version is only invisible to its deleting transaction until it commits.
- Once the deleting transaction commits, all new subsequent transactions won't be able to see the row version. Other existing transactions continue to see the row version. The row version is a "dead tuple" now.
- Finally, when there are no transactions remaining that can see the dead tuple, vacuum removes it.
How tuple visibility works:
- Each transaction uses its own snapshot, commit log (clog), and the xmin and/or xmax in the target tuple header, to determine whether it can see a given row version.
- What snapshot is:
  - A picture of what transactions are running at a certain point of time.
  - You can run "SELECT pg_current_snapshot();" to see the snapshot of the current transaction.
  - The snapshot's textual representation is xmin:xmax:xip_list. e.g., 10:20:10,14,15.
  - xmin: Lowest transaction ID that was still active. All transaction IDs less than xmin are either committed and visible, or rolled back and dead.
  - xmax: One past the highest completed transaction ID. All transaction IDs greater than or equal to xmax had not yet completed as of the time of the snapshot, and thus are invisible.
  - xip_list: Transactions in progress at the time of the snapshot. A transaction ID that is xmin <= X < xmax and not in this list was already completed at the time of the snapshot, and thus is either visible or dead according to its commit status.
  - In a READ COMMITTED transaction, a snapshot is taken at the beginning of every SQL statement.
  - In a REPEATABLE READ or SERIALIZABLE transaction, a snapshot is obtained at the start of the first SQL statement and used throughout the transaction.
- What commit log (clog) is:
  - An array of bits representing the transaction status.
  - Two bits are used to indicate a transaction's outcome: IN PROGRESS, COMMITTED, ABORTED, SUBCOMMITTED.
  - Stored in a set of files in $LTDATA/lt_xact/.
  - Cached in memory buffers of 128 8 KB pages.
- Clog is consulted when the snapshot shows that the target transaction has been completed.
- Based on the snapshot and clog, the change by a committed transaction is visible, and that by an aborted or running transaction is invisible.
- The actual tuple visibility is much more complex...

Hint bit

Hint bits are the bits in the tuple header's infomask field that help determine tuple visibility.
They are for performance optimization. Not essential to data correctness.
They represent whether the transaction indicated by xmin or xmax was committed or aborted. There are four flag bits:
1. HEAP_XMIN_COMMITTED: xmin transaction was committed
2. HEAP_XMIN_INVALID: xmin transaction was aborted
3. HEAP_XMAX_COMMITTED: xmax transaction was committed
4. HEAP_XMAX_INVALID: xmax transaction was aborted
How hint bits are used:
1. A transaction checks the hint bits to see if the xmin and/or xmax transaction was committed or aborted.
2. If the hint bits are set, done.
3. Otherwise, examine the commit log ($LTDATA/lt_xact/), and possibly the subtransaction hierarchy ($LTDATA/lt_subtrans/) to determine the transaction outcome. This is an expensive operation.
4. Set the hint bits. They will be persisted to disk later.
Setting hint bits writes a data page, and can also write WAL if page checksums are enabled.

3.3.4. Lock

A lock request can wait, even when the requested mode is compatible with held locks.

Q: Do you think Transaction 3 goes on to run the query?
1. Transaction 1: A long-running SELECT is still running against mytable.
2. Transaction 2: Run "ALTER TABLE mytable ADD COLUMN new_col int;". Get blocked because ALTER TABLE's Access Exclusive lock request conflicts with the Access Share lock held by Transaction 1's SELECT.
3. Transaction 3: Run a short SELECT query against mytable.
A: Transaction 3 waits until Transaction 2 completes, because Transaction 2 came earlier and is waiting.
Later requestors respect earlier waiters in the wait queue and do not overtake them. Otherwise, earlier requestors might wait for an unduly long time.
Therefore, execute even the DDL that is expected to run fast:
- during off-peak hours, and/or
- with a lock timeout. e.g., run "SET lock_timeout = '5s';" before the DDL. Retry the DDL if it times out.
This is not true for lightweight locks. In extreme cases, an Exclusive mode request on a LWLock could wait for dozens of seconds due to later Share mode requestors coming one after another.

Prepared transactions lurk holding locks

A prepared transaction continues to hold locks, but it does not appear in pg_stat_activity because it has no associated session.
pg_locks shows the prepared transaction as an entry having NULL pid. Check pg_prepared_xacts.

3.3.5. Data integrity validation

Data checksums

Purpose and usage
- The data page of every relation, including all forks, has a 16-bit checksum in its page header.
- Designed to detect corruption by the I/O system (e.g., volume manager, file system, disk driver, storage firmware, storage device, etc.) Early detection prevents the propagation of corruption.
- Not designed to detect memory errors.
- Enabled at the full cluster level, either with lt_initdb's -k/--data-checksums or with lt_checksums while the database server is shut down. Disabled by default due to its performance overhead.
- Run "SHOW data_checksums" to know if data checksums are enabled. It returns on or off.
How it works
- The checksum is calculated from the page content and set when the data page is about to be written out to disk.
- Just after reading the page from disk, the checksum is verified by comparing the value set in the page header and the newly calculated value.
- If the verification fails, a WARNING and ERROR messages are emitted, resulting in a query failure.
- If the page header fails a basic sanity check before performing checksum verification, the query will fail with the same ERROR message, without the WARNING that indicates the checksum failure.

WAL CRC

WAL uses a 32-bit CRC in each WAL record header.
The CRC is set when the WAL record is put in the WAL buffer, and verified when the WAL record is read.

Utilities to detect, bypass, or repair data corruption (some could be dangerous!)

Additional modules
- amcheck: Detects logical corruption of heaps (table, sequence) and B-tree indexes.
- pg_visibility_map: pg_check_frozen() and pg_check_visible() detect visibility map corruption.
Configuration parameters
- ignore_checksum_failure
- zero_damaged_pages
- ignore_system_indexes

3.3.6. Backup and recovery

Backup and recovery methods

File system level backup (binary format)
SQL Dump with lt_dump/lt_dumpall (text format)
Continuous archiving (binary format)

Characteristics of backup and recovery methods

SQL dump and continuous archiving can be performed online. The file system level backup requires the database server to be shut down.
SQL dump can selectively back up and restore individual tables. The other methods cannot back up or restore only certain individual tables or tablespaces.
The SQL dump will typically be smaller, because the SQL script needs to contain just the index creation command, not the index data.
The SQL dump can be loaded into a database of a newer major version.
The SQL dump can transfer a database to a different machine architecture, such as going from a 32-bit to a 64-bit server.
Continuous archiving can perform PITR. The database cluster can be recovered up-to-date or to a certain point of time.
Dumps created by lt_dump are consistent; the dump of each database is a snapshot of the database when lt_dump started. lt_dumpall calls lt_dump for each database in turn, so the database cluster-wide consistency is not guaranteed.
lt_dump dumps all data in a database within a single transaction, issuing many SELECT commands. That long-running transaction could:
- block other operations that require strong lock modes, such as ALTER TABLE, TRUNCATE, CLUSTER, REINDEX.
- cause table and index bloat, because vacuum cannot remove dead tuples.
pg_start_backup() and pg_basebackup of continuous archiving performs a checkpoint at the beginning. The user can choose the checkpoint speed between "fast" and "spread".
Archive recovery, as well as crash recovery, empties the content of unlogged relations. SQL dump outputs the contents of unlogged tables.
lt_dumpall's --no-role-passwords option uses pg_roles instead of pg_authid to dump database roles. This allows the use of lt_dumpall in restricted environments like DBaaS where users are not permitted to read pg_authid to protect passwords. The restored roles will have NULL passwords.

3.3.7. Streaming replication

Architecture

Topology
- Only the entire database cluster is replicated. Partial replication is not possible.
- One primary server replicates to one or more standby servers.
- Each standby replicates from one primary.
- The standby can cascade changes to other standbys.
- The primary is unaware of the locations of standbys. The standby connects to the primary specified by primary_conninfo parameter.
  - ex. primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
Primary and standby versions
- Different major versions don't work.
- Different minor versions will work because the disk format is the same, but no formal support is offered. It's advised to keep primary and standby servers at the same minor version.
- It's safest to update the standby servers first during minor version upgrade.
Processes and data flow
- At server startup, the standby first reads and applies WAL from archive, next from $LTDATA/lt_wal/, and then launches one walreceiver, which connects to the primary and streams WAL from there. If the replication connection is terminated, it repeats this cycle at 5 second intervals, which can be configured by wal_retrieve_retry_interval.
- The primary spawns the walsender when it accepts the connection request from walreceiver.
- walsender reads and sends WAL to the walreceiver.
- walreceiver writes and flushes the streamed WAL to $LTDATA/lt_wal/, and notifies the startup process.
- A single startup process reads and applies the WAL.
- walreceiver periodically notifies the walsender of replication progress -- how far it has written, flushed, and applied the WAL.
- A cascading standby has walsenders as well as a walreceiver running.
Replication user
- The replication user needs REPLICATION role attribute.
- REPLICATION enables the user to read all data for replication, but not for consumption by SELECT queries.

General administration

The standby is read-only. Any object, including roles, cannot be created only on the standby.
max_wal_senders should be slightly higher than the number of standby servers, so that the standby can accept connections after a temporary unexpected disconnection while the disconnected walsender still remains.
Backups can be taken on the standby.
archive_timeout is not required to reduce the data loss window.
Cascading replication reduces the load on the primary.
WAL on the primary
- Without any measure, the primary does not care about the standby and remove/recycle old WAL files that the standby still need.
- If the standby requests WAL that has already been removed, the primary emits a message like "ERROR: requested WAL segment 000000020000000300000041 has already been removed".
- To make the primary preserve WAL files, either use a replication slot (preferred) or set keep_wal_size (old method).
- max_slot_wal_keep_size caps the WAL volume preserved by the replication slot.
Synchronous replication
- The transaction hangs during its commit if no synchronous standby is available.
- To resume the hanging transaction, remove synchronous_standby_names setting and reload the configuration. That makes the replication asynchronous.

Causes of replication lag

Hardware configuration: Server, storage, and network
Heavy workload on the primary: The amount of WAL generated on the primary is so large that the solo startup process cannot keep up.
- Set wal_compression = on to reduce the amount of WAL.
Retrieving WAL from slow archive: The standby could not get WAL from the primary, so it has to fetch it from the WAL archive.
Recovery conflicts: The replay of some operations can be blocked by queries running on the standby.
- This is relevant when hot standby is used.
- Reduce max_standby_archive_delay and max_standby_streaming_delay to cancel conflicting queries and resume WAL replay early.

Hot standby

The ability to run read-only queries while the server is in archive recovery or standby mode.
Recovery conflicts
- Conflicts between the WAL replay and queries on the standby.
- Either delay WAL replay or cancel queries.
- Actions on the primary that cause recovery conflicts include:
  - Operations that take Access Exclusive locks: DDL, LOCK, file truncation by vacuum (including autovacuum)
    Access Exclusive lock requests are WAL-logged and replayed by the standby.
  - Dropping a tablespace where queries put temporary files on the standby
  - Dropping a database to which clients are connected on the standby
  - Vacuum cleanup of dead tuples that standby transaction still can see according to their snapshots
  - Vacuum cleanup of a page on which standby transactions have a buffer pin (e.g., the cursor is positioned on the page.)
- What happens upon recovery conflicts
  - WAL application waits for at most the period specified by max_standby_archive_delay and max_standby_streaming_delay (except for the replay of DROP DATABASE and ALTER DATABASE SET TABLESPACE.)
  - Then, conflicting sessions are terminated in the case of replaying DROP DATABASE, or conflicting queries are canceled in other cases.
  - If an idle session holds a lock, the session is also terminated.
- Monitoring recovery conflicts
  - pg_stat_database_conflicts on the standby shows the number of canceled queries due to each type of recovery conflict.
  - "log_recovery_conflict_waits = on" logs messages that the WAL application has waited longer than deadlock_timeout and the wait finished.
    LOG: recovery still waiting after 1.023 ms: recovery conflict on snapshot
    DETAIL: Conflicting processes: 1234, 1235
    LOG: recovery finished waiting after 3.234 ms: recovery conflict on snapshot
- Minimizing the number of queries cancelled due to recovery conflict
  - Avoid operations that require Access Exclusive locks. e.g., ALTER TABLE, VACUUM FULL, CLUSTER, REINDEX, TRUNCATE
  - Disable file truncation by vacuum by setting vacuum_truncate storage parameter on the primary.
    ex. ALTER TABLE some_table SET (vacuum_truncate = off);
  - Set hot_standby_feedback = on the standby.
    sends the oldest XID to the primary, reflected in pg_stat_replication.backend_xmin, which is taken into account when vacuum decides to remove a dead tuple.
    can incur table bloat because dead tuple removal is delayed.
    cannot prevent all conflicts.
  - Adjust max_standby_streaming_delay/max_standby_archive_delay on the standby.
  - Adjust vacuum_defer_cleanup_age on the primary.
- It's ideal to have separate standbys, some for high availability and others for read workloads that tolerate stale data.

Monitoring replication lag

pg_stat_replication
- Available not only on the primary but also on the cascading standby.
- Large differences between pg_current_wal_lsn and the view's sent_lsn field might indicate that the primary server is under heavy load.
- Differences between sent_lsn and pg_last_wal_receive_lsn on the standby might indicate network delay, or that the standby is under heavy load.
pg_stat_wal_receiver
- A large difference between pg_last_wal_replay_lsn() and the view's flushed_lsn indicates that WAL is being received faster than it can be replayed.
- ex. SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), flushed_lsn) FROM pg_stat_wal_receiver;
Storage write latency, IOPs, and throughput to check for heavy write activity.
pg_stat_database_conflicts

3.3.8. Logical replication

Architecture

Uses a publish and subscribe model
- A publication is a collection of tables whose changes are to be replicated.
- A subscription represents a connection to the publisher and its publications to subscribe to.
- One publisher publishes one or more publications.
- One subscriber has one or more subscriptions.
- A publication can have multiple subscribers.
- A subscription can subscribe to multiple publications.
- Publications can choose to limit the changes they produce to any combination of INSERT, UPDATE, DELETE, and TRUNCATE.
- Publications can restrict the rows and columns to be replicated.
Processes and data flow
- Processes involved: walsender on the publisher, subscription workers (apply worker, tablesync worker) on the subscriber.
- walreceiver does not appear, even though some walreceiver-related parameters are used.
1. At the server startup on the subscriber, logical replication launcher is started unless max_logical_replication_workers is 0.
2. logical replication launcher starts an apply worker for each enabled subscription.
3. The apply worker connects to the publisher.
4. The apply worker launches tablesync workers for tables that have not completed initial synchronization. Those tablesync workers each connect to the publisher.
5. The publisher spawns a walsender for each connection request from the subscription workers.
6. The walsender for a tablesync worker sends the initial copy of a table to the tablesync worker. (Initial data synchronization/copy)
7. walsender reads WAL, decodes changes into the logical replication protocol format, and store them in the logical decoding work memory and possibly file. When a transaction commits, walsender sends its decoded changes to the subscription workers.
8. The subscription workers apply the received changes.

General administration

Major restrictions
- Publications can only contain tables.
- DDL are not replicated.
  - Add table columns on the subscriber first, then on the publisher. Reverse the order when dropping table columns.
- Sequence data is not replicated.
Replication identity
- A published table must have a replica identity to replicate UPDATE and DELETE operations.
- Used as a key to identify rows to update or delete on the subscriber.
- UPDATE and DELETE fail on the publisher if the published table has no replica identity. INSERT succeeds.
- Can be either of the primary key (by default), unique index, or the full row.
- Can be configured by ALTER TABLE REPLICA IDENTITY.
- The old values of replica identity columns are WAL-logged.
Tuning performance
- max_sync_workers_per_subscription
  - Multiple tablesync workers (one for each table) will run in parallel based on the max_sync_workers_per_subscription configuration.
  - This may be effective to speed up initial table synchronization when there are many tables in a subscription.
- logical_decoding_work_mem
  - Check pg_stat_replication for spilled transactions to disk. If spill_txns, spill_count, and spill_bytes are high, consider increasing this parameter value.

Replication conflicts

The application of incoming changes on the subscriber may fail due to constraint violation or lack of permission. This is the conflict.
Resolving conflicts:
1. Disable the subscription if it's not yet by running "ALTER SUBSCRIPTION name DISABLE;". The subscription can be configured to be automatically disabled when any errors are detected by the apply worker. Run "ALTER SUBSCRIPTION ... WITH (disable_on_error = on);"
2. Look up the replication origin name and the end LSN of a conflicting transaction in the server log.
3. Do either of:
  - skip the transaction from publisher by running "ALTER SUBSCRIPTION ... SKIP (end LSN of a conflicting transaction)" or "SELECT pg_replication_origin_advance(rep_origin_name, next LSN of the end of a conflicting transaction)"
  - fix the table data on the subscriber.
4. Enable the subscription by running "ALTER SUBSCRIPTION name ENABLE;".