3.5. Manageability

3.5.1. Memory
3.5.2. Storage
3.5.3. Logging and debugging
3.5.4. Vacuum
3.5.5. Upgrade

3.5.1. Memory

A large result set causes out of memory on the client

  • When running SELECT, ltsql retrieves the entire result set and stores all rows in the client memory.

  • With the FETCH_COUNT variable like "ltsql -v FETCH_COUNT=100 ...", ltsql uses a cursor and issues DECLARE, FETCH, and CLOSE to retrieve the result set piecemeal.

  • Client drivers have similar facility, such as psqlODBC's UseDeclareFetch and ltJDBC's defaultRowFetchSize connection parameter.

Common causes of server-side out of memory (OOM) issues

  • A high number of connections

    • Even idle connections could continue to hold much memory. Lightdb keeps database object metadata in memory during the session. This is for performance. You can notice this by bloated CacheMemoryContext.

    • If connection pooling is used, many connections may consume large amounts of memory over time. This is because connections are picked up randomly from the pool, used to access some relations, and released back to the pool, which results in many sessions accumulating the meta data of many relations.

  • A high value of work_mem

    • It's advised not to set a high value to work_mem at an instance level (lightdb.conf) or a database level (ALTER DATABASE). Many sessions could allocate that amount of memory simultaneously. What's worse, each SQL statement could run such sort and/or hash operations in parallel, each of which can allocate as much memory as work_mem.

    • For a hash-based operation, work_mem * hash_mem_multiplier bytes of work memory will be allocated at maximum.

  • Low max_locks_per_transaction

    • Each lockable object (e.g., table, index, sequence, XID, but not row) is allocated an entry in the lock table when it's locked. The entry represents the lockable object, grantees, waiters, and granted/requested lock modes.

    • The lock table is allocated in shared memory. Its size is fixed at server startup.

    • The default value of max_locks_per_transaction is 64. This means that each transaction is expected to lock 64 or fewer objects.

    • The number of entries in the lock table is (max_connections + max_prepared_transactions + alpha) * max_locks_per_transaction.

    • One transaction can use more than max_locks_per_transaction entries, if those are available.

    • If each of many concurrent transactions may access more objects, say, touch hundreds or thousands of partitions, increase max_locks_per_transaction.

Cannot retrieve a large bytea value

  • For example, after successfully inserting 550 MB of bytea column value, fetching it fails with an error message like "invalid memory aloc request size 1277232195".

  • Why?

    • When the Lightdb server sends query results to the client, it either converts the data to text format or returns it in binary format.

    • ltsql and client drivers instruct the server to use text format.

    • Lightdb uses either hex or escape format when it converts bytea data to text format. The default format is hex. hex and escape formats use 2 and 4 bytes respectively to represent each original byte in text format.

      • ex. SELECT 'abc'::bytea; returns \x616263

    • The Lightdb server allocates one contiguous memory area to convert each column value to text format. This allocation size is limited to 1 GB - 1. That restriction has something to do with the handling of variable-length data types in TOAST.

    • Due to this limitation, Lightdb cannot return bytea data over 500 MB in text format.


3.5.2. Storage

Common causes of full storage

  • Table bloat because vacuum cannot remove dead tuples: the reasons dead tuples remain are described separately.

  • WAL accumulation: the reasons WAL volume continues to grow are described separately.

  • Server log files

    • Excessive logging because of ltAudit, auto_explain, and other logging parameters such as log_statement and log_min_duration_statement

    • Log rotation and purge are not configured properly: log_rotation_age, log_rotation_size, log_truncate_on_rotation

  • Temporary files are created

    • The work_mem is small and/or the query plan is bad.

    • A holdable cursor is kept open.

      • e.g. DECLARE CURSOR cur WITH HOLD FOR SELECT * FROM mytable; COMMIT;

      • During the commit, the result set of the holdable cursor is stored in a work memory area of size work_mem, and the content beyond work_mem is spilled to a temporary file.

    • Check temporary file usage with:

      • temp_files and temp_bytes of pg_stat_database

      • log_temp_files = on, which logs the file path and size when the file is deleted

      • query plans obtained by EXPLAIN ANALYZE or auto_explain

Storage quota

  • Lightdb cannot constrain storage usage except for temporary files.

  • temp_file_limit can limit the total size of temporary files used by each session at any instant. A transaction exceeding this limit will be aborted.

  • If you want to limit the size of a database, table, or WAL ($LTDATA/lt_wal/), put it in a tablespace on a file system with limited size.

    • The tablespace of a database/table can be specified explicitly by CREATE/ALTER DATABASE/TABLE ... TABLESPACE, or implicitly by the default_tablespace parameter.

    • temp_tablespaces can be used to specify where temporary files are created for temporary tables/indexes and sort/hash operations.

    • WAL directory can be specified by lt_initdb's --waldir option. Also, after the database cluster has been created, it can be moved outside the data directory and linked with a symbolic link.


3.5.3. Logging and debugging

FATAL: database is starting up

  • In old major versions, this message can be output at 1 second intervals during the server startup.

  • This may look startling, but it's not an actual problem.

  • "Why? lt_ctl start" launches postmaster in the background, and tries to connect to the database at 1 second intervals. If the connection is successful, lt_ctl returns success. Otherwise, if the server is still performing recovery and unable to accept connections, the above message is reported.

  • In newer major versions, you won't see the message any more. lt_ctl does not attempt connection. Instead, postmaster writes "ready" in postmaster.pid when it can accept connections, and lt_ctl checks it.

Avoid excessive logging by restricting targets

  • Not only logging but many parameters can be configured for each user, database, or the combination of them. For example:

    • ALTER USER oltp_user SET log_min_duration_statement = '3s';

    • ALTER DATABASE analytics_db SET log_min_duration_statement = '60s';

    • ALTER USER batch_user IN DATABASE oltp_db SET log_min_duration_statement = '30s';

Debug-logging can be enabled for a session without cluttering the server log

  • It may not be acceptable to globally set log_min_messages to DEBUG1 - DEBUG5, because that would output voluminous logs.

  • You can obtain debug messages for a particular operation only on the client like this:

    • export LTOPTIONS="-c client_min_messages=DEBUG5"

ltsql -d postgres -c "select 1"

Find out what ltsql's backslash commands do

  • Use ltsql's -E/--echo-hidden option. It reveals the query issued in the background.

Deleting duplicate rows

  • The following query deletes duplicate rows, leaving the one with the minimum ctid and displays the deleted row content.

  • ctid is a system column that represents the physical location of the row version within its table: (block number, item ID). ctid can change due to UPDATE and VACUUM FULL, so it'd be probably safe to lock the table in Share or stronger mode during this operation.

WITH x AS (SELECT some_table dup, min(ctid)
    FROM        some_table
    GROUP BY 1
    HAVING count(*) > 1
)
DELETE FROM    some_table
USING     x
WHERE     (some_table) = (dup)
    AND some_table.ctid <> x.min
RETURNING some_table.*;

3.5.4. Vacuum

Purposes of vacuum

  • To recover or reuse disk space occupied by updated or deleted rows.

  • To update data statistics used by the Lightdb query planner.

  • To update the visibility map, which speeds up index-only scans.

  • To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.

Vacuum types

  • Concurrent (lazy, or regular) vacuum

    • Acquires a Share Update Exclusive lock on the target relation. Does not prevent SELECT and DML commands.

    • Keeps the original data files and modifies them. TIDs do not change.

    • The data file shrinks only when there are more than certain number of contiguous empty blocks at the end. Unused space in the middle of the file is left for reuse.

    • Reports its progress in the pg_stat_progress_vacuum view.

  • FULL vacuum

    • Acquires an Access Exclusive lock on the target relation. Prevents SELECT and DML commands.

    • Copies live tuples from the old data files to new data files, and remove old data files. Rebuilds indexes. TIDs change.

    • The data files will be packed fully and minimal.

    • Could uses twice the disk space: one for existing relations, and another for new ones.

    • Always aggressively freezes tuples.

    • The actual processing is the same as CLUSTER.

    • Reports its progress in the pg_stat_progress_cluster view.

  • Autovacuum never runs FULL vacuum.

Main steps of vacuum

  1. Starts a transaction.

    • When there are multiple target relations, vacuum starts and commits a transaction for each relation to release locks as soon as possible.

  2. Gets an Share Update Exclusive lock for a heap and opens it. Non-wrap-around-prevention vacuum gives up vacuuming the relation if the relation cannot get the lock, emitting the following message.

    • LOG: skipping of vacuum "rel_name" --- lock not available

  3. Gets Row Exclusive locks for the indexes and opens them.

  4. Allocates the work memory to accumulate the TIDs of dead tuples.

  5. Repeats the following steps until the entire heap has been processed:

    • Scans the heap: Accumulates dead tuple TIDs in the work memory until it gets full or the end of the heap is reached. The item IDs for the dead tuples are retained. Also, prunes and defragments each page if required, and possibly freezes live tuples.

    • Vacuums the indexes: Delete index entries that contain dead tuple TIDs.

    • Vacuums the heap: Reclaims the item IDs for the dead tuples. This is done here, not while scanning the heap, because the item ID cannot be freed until the index entries pointing to it have been deleted.

    • Updates the FSM and VM during the above processing.

  6. Cleans up the indexes.

    • Updates every index's stats in pg_class's relpages and reltuples.

    • Closes the indexes but retains their locks until transaction end.

  7. Truncate the heap so as to return empty pages at the end of the relation to the operating system.

    • The data file is truncated if the heap has at least the lesser of 1,000 blocks and (relation_size / 16) contiguous empty blocks at its end.

    • Takes an Access Exclusive lock on the heap. If another transaction holds a conflicting lock, wait for at most 5 seconds. If the lock cannot be obtained, gives up truncating.

    • Scans backwards the heap to verify that the end pages are still empty. Periodically checks if another transaction is waiting for a conflicting lock. If someone else is waiting, releases the Access Exclusive lock and gives up truncating.

  8. Updates relation stats.

    • Updates pg_class's relpages, reltuples, relallvisible, relhasindex, relhasrules, relhastriggers, relfrozenxid, and relminmxid.

  9. Close the relation.

  10. Commits a transaction.

  11. Vacuums the relation's TOAST table.

  12. Repeats the above processing for each relation.

  13. Updates database stats.

    • Updates pg_database.datfrozenxid to be the minimum of pg_class.relfrozenxid values, and truncates commit log in lt_xact/.

    • Updates pg_database.datminmxid to be the minimum of pg_class.relminmxid values, and truncates MultiXact data in lt_multixact/.

Autovacuum is designed to be non-intrusive

  • Autovacuum takes a rest (sleep) every time it has done a certain amount of work. Therefore, it does not continuously consume resources.

    • "A certain amount of work" and the sleep time can be configured by autovacuum_vacuum_cost_limit and autovacuum_vacuum_cost_delay respectively. autovacuum_vacuum_cost_delay is 2 ms by default.

  • Autovacuum skips the relation if it cannot get a lock on it due to some conflicting lock. Wrap-around-prevention autovacuum does not do this.

  • A concurrent transaction cancels non-aggressive autovacuum, if it has waited on a conflicting relation lock for deadlock_timeout seconds and found out that the lock is held by autovacuum. These messages can be seen:

    • ERROR: canceling autovacuum task

    • DETAIL: automatic vacuum of table "mytable"

  • Vacuum skips reading data pages if VM shows that they have only tuples visible to all transactions (all-visible bit is set in VM). Aggressive vacuum reads even such pages to freeze tuples.

  • Vacuum skips reading data pages if VM shows that they have only frozen tuples (all-frozen bit is set in VM).

  • Autovacuum performs reduced work on a data page when it cannot get an exclusive LWLock on it. Autovacuum for wrap-around does not do this.

  • Vacuum gives up truncating the relation if another transaction holds or is waiting for a lock on the target relation.

Autovacuum is not run against a relation

  • Check the following to see if that is the case.

    • last_autovacuum and autovacuum_count columns of pg_stat_all_tables

    • Server logs after setting log_autovacuum_min_duration to 0

  • Common reasons

    • The relation has to be eligible for autovacuum.

      • UPDATE/DELETE-mostly relations: updated/deleted tuples >= autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * pg_class.reltuples

      • INSERT-mostly relations: inserted tuples >= autovacuum_vacuum_insert_threshold + autovacuum_vacuum_insert_scale_factor * pg_class.reltuples

    • Autovacuum workers are busy with other many and/or large relations.

    • Some transactions continuously request or hold a conflicting relation lock for long. Non-wrap-around-prevention vacuum gives up such a relation.

    • Autovacuum cannot vacuum temporary tables. Manual vacuum needs to be run. This might lead to XID wrap-around and database shutdown.

    • Statistics stored in lt_stat/ were lost due to crash or archive recovery, including failover. Those statistics are always reset during recovery. Autovacuum depends on the statistics, which can be seen via pg_stat_all_tables, to determine if vacuuming is needed.

Why vacuum does not remove dead tuples

  • Slow autovacuum

  • Long-running transactions

  • Physical standbys with hot_standby_feedback = on

  • Unused replication slots

  • Orphaned prepared transactions

Reducing the risk of XID wrap-around

  • Reduce XID consumption.

    • Each subtransaction allocates its own XID. A subtransaction is started by SAVEPOINT and PL/lightdb's exception block (BEGIN ... EXCEPTION).

    • Some client drivers offer statement-level rollback. It encloses each SQL statement with SAVEPOINT and RELEASE SAVEPOINT.

  • Make autovacuum run smoothly (see above).

  • Lower autovacuum_vacuum_insert_scale_factor or autovacuum_freeze_max_age so that autovacuum processes the table more frequently.

  • Schedule regular VACUUM FREEZE runs.

Configuration to speed up autovacuum

  • Lower autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, autovacuum_vacuum_insert_threshold, autovacuum_vacuum_insert_scale_factor for large tables.

  • Decrease autovacuum_naptime

    • Even 1s is practical if the write workload is heavy and the host has many CPU cores.

  • Increase autovacuum_max_workers

    • Effective when there are many relations. Each relation is handled by only one autovacuum worker.

    • Increase autovacuum_vacuum_cost_limit as well. Otherwise, each autovacuum worker would sleep more frequently, because cost limits are shared among all active autovacuum workers.

  • Increase maintenance_work_mem/autovacuum_work_mem

    • The work memory stores an array of dead tuple TIDs. A TID is (block no, item no), which is 6 bytes.

    • Setting a large value reduces the number of index scans.

    • The maximum allocated size is 1 GB - 1 no matter how large the parameter values are.

    • Does not always allocate the specified size. The actual size is large enough to accommodate all possible TIDs, so it will be small for small tables.

    • For a table with no index, only less than 2 KB is allocated. Vacuum only accumulates TIDs for one table block because it does not need to scan indexes.

  • Increase vacuum_buffer_usage_limit

    • Vacuum uses 256 KB of ring buffers by default to cache data pages, so that it does not evict pages that are likely to be used by applications.

    • Vacuum also benefits from caching pages: heap pages are read twice, and index pages may possibly be read more than once.

    • Setting this to 0 allows vacuum to use shared buffers without limit.

  • Decrease autovacuum_vacuum_cost_delay, increase autovacuum_vacuum_cost_limit

    • Setting autovacuum_vacuum_cost_delay to 0, which keeps autovacuum running like the manual vacuum.

  • Partition a large table so that multiple autovacuum workers can work on its partitions concurrently.

  • Delete unnecessary indexes.

    • Autovacuum processes indexes one at a time. (Manual vacuum can process them in parallel with its PARALLEL option.)


3.5.5. Upgrade

Characteristics of versions

  • Major version

    • Contains new features and incompatibilities.

    • Released once a year.

    • Sensitive bug fixes are only incorporated into the latest major version. "Sensitive" includes the fixes that could lead to incompatibility, adverse effects such as unstability and security, or require lots of code changes not worth the benefit.

    • The upgrade can skip intervening major versions.

    • Always requires careful planning and testing to deal with incompatible changes.

  • Minor version

    • Contains only frequently-encountered bugs, security issues, and data corruption problems to reduce the risk associated with upgrading.

    • Running the latest minor version is always recommended. The community considers not upgrading to be riskier than upgrading.

    • Released at least once every three months, the second Thursday of February, May, August, November. Additional minor versions may be released to address urgent issues.

    • The upgrade can skip intervening minor versions.

    • Does not normally require a dump and restore; you can stop the database server, install the updated binaries, and restart the server.

    • Additional manual steps may be required for some minor versions to remedy the bad effects of fixed bugs, such as rebuilding affected indexes. See the section "Migration to Version <major>.<minor>" in the release note.

Major upgrade methods

  1. lt_dumpall/lt_dump and ltsql/lt_restore: easy, long downtime

  2. lt_upgrade: relatively easy, shorter downtime

  3. Logical replication: complex setup and operation, minimal downtime

Overview of lt_upgrade

  • Upgrades a database cluster to a later major version without dump/restore of user data.

  • Not an in-place upgrade: migrates data from an old database cluster to a new database cluster freshly created with lt_initdb.

  • The basic idea is that because the relation data storage format rarely changes and only the layout of system catalogs change, lt_upgrade just dumps and restores the database schema and uses relation data files as-is.

  • Downgrade is not possible.

  • Does not migrate database statistics in pg_statistic. The user needs to run ANALYZE in every database after lt_upgrade completes.

Main steps of lt_upgrade

  1. Creates output directory for log and intermediate files.

  2. Checks that the major version of the target cluster is newer, and that the old and new clusters are binary-compatible by comparing information in lt_control.

  3. Gets the list of databases and relations (table, index, TOAST table) of the old cluster.

  4. Gets the list of library names that contain C-language functions.

  5. Performs various checks to find blockers of upgrade, such as the inability to connect to databases and the presence of prepared transactions.

  6. Creates a dump of global objects by running lt_dumpall --globals-only.

  7. Creates dumps of each database by running lt_dump --schema-only. This is parallelized by spawning one process or thread for each database when --jobs is specified.

  8. Checks the previously extracted loadable libraries with C-language functions exist in the new cluster by running LOAD.

  9. Copies commit log files in lt_xact/ and MultiXact files in lt_multixact/ from the old cluster to the new cluster.

  10. Sets next XID and MultiXact ID for the new cluster to take over the old cluster.

  11. Restores global objects in the new cluster by running ltsql.

  12. Restores database schemas in the new cluster by running lt_restore. This is parallelized by spawning one process or thread for each database when --jobs is specified.

  13. Gets the list of databases and relations (table, index, TOAST table) of the new cluster.

  14. Links or copies user relation files from the old cluster to the new cluster. This is parallelized by spawning one process or thread for each tablespace when --jobs is specified.

  15. Sets next OID for the new cluster.

  16. Creates a script to delete the old cluster (delete_old_cluster.sh). This script removes the data directory and tablespace version directories.

  17. Reports extensions that should be updated and creates update_extensions.sql. This script contains a list of ALTER EXTENSION ... UPDATE commands.

Log files for troubleshooting lt_upgrade

  • Stored in $NEWPGDATA/pg_upgrade_output.d/<timestamp>/

  • Removed when lt_upgrade completes successfully.

  • Files:

    • pg_upgrade_server.log: The lightdb server logs. Specified as lt_ctl's -l.

    • pg_upgrade_dump_<DB-OID>.log: Logs of lt_dump and lt_restore.

    • pg_upgrade_utility.log: Logs of miscellaneous commands run by lt_upgrade, such as ltsql, lt_resetwal. This includes lt_dumpall/ltsql to dump and restore global objects.

    • pg_upgrade_internal.log: Other lt_upgrade logs.

    • loadable_libraries.txt: List of C-language function libraries that exist in the old cluster but are not found in the new cluster.