Want to handle many concurrent clients? Then do:
Set up connection pooling on each application server as well as on a central server.
Limit max_connections and the actual number of connections to a few times the number of CPU cores, or at most a few hundreds.
Performance tends to drop above this limit, mainly because of:
high memory usage by client backends, possibly leading to swapping.
CPU context switches
CPU cache line contention
locks, particularly spinlocks: if one process holds a spinlock and other processes comes to the same protected section, those latecomers will wait spinning on the lock and continues to consume CPU.
processing Lightdb internal data structures: some data structures and its processing depends on the number of connections; creating a snapshot stands out here
Even idle connections are not innocent. They contribute to high resource usage.
Increase track_activity_query_size for ORMs (Object Relational Mappers)
Some views such as pg_stat_activity and pg_stat_statements show query strings.
Because those query strings are stored in fixed-size shared memory, the length of each such query string is fixed, which is track_activity_query_size. Longer queries are cut off at this limit.
Hibernate or some other ORMs produce very long queries. It may be useful to set track_activity_query_size to 32 KB or so.
Server logging could block but not show waits
The logging collector (logger) is the sole process to write logs to server log files.
Every backend process writes logs to its standard error, which is connected through a Unix pipe to the read endpoint of the logger. The logger reads messages from the pipe and writes them to files.
If the pipe gets full, the backend could block when writing to it. This happens when the logger is behind due to overwhelming amount of logging, for example, when some combination of ltAudit, auto_explain, and log_min_duration_statement is used and many concurrent sessions are running short queries.
The block on the pipe is not treated as a wait event (possibly by mistake), so the backend appears to be consuming CPU.
ALTER TABLE SET UNLOGGED/LOGGED
is heavy
This rewrites the entire table into new data files and WAL-logs those writes.
Hence, you cannot use it for efficient data loading - switching the table to UNLOGGED, load data into the table, and setting it back to LOGGED.
Queries or the first vacuum are slow after loading data with COPY
This is because those commands have to set hint bits for rows they want to see. Setting hint bits modifies shared buffers and WAL-logs the changes if data checksums are enabled. These could generate massive writes.
COPY (FREEZE)
comes to the rescue. FREEZE
option freezes the loaded rows and set their hint bits.
The table must have been created or truncated in the current subtransaction. This is to prevent other transactions from seeing the frozen rows before the COPY transaction commits.
Speeding up queries via postgres_fdw
Run ANALYZE manually on foreign tables
autovacuum does not execute ANALYZE on foreign tables. Hence, the local statistics may get stale and lead to poor query plans.
Enable use_remote_estimate for long-running queries
i.e.,
ALTER FOREIGN SERVER/TABLE ... OPTIONS (use_remote_estimate 'true');
This makes postgres_fdw issue EXPLAIN to perform the cost estimate on the remote server.
The query planning time will get longer due to the round-trip for EXPLAIN. So, this may not be worth the cost for short queries. You can use different foreign servers/tables with different settings for OLTP, batch, and analytics workloads.
Increase fetch_size
i.e.,
ALTER FOREIGN SERVER/TABLE ... OPTIONS (fetch_size '1000');
postgres_fdw uses a cursor to fetch rows from a foreign table. fetch_size determines the number of rows to fetch at a time. The default is 100.
If the network latency is high, reducing the round-trips by increasing this setting may help. Be aware that higher values require more memory to store fetched rows.
Increase batch_size
i.e.,
ALTER FOREIGN SERVER/TABLE ... OPTIONS (batch_size '1000');
By default, postgres_fdw inserts one row at a time into a
foreign table during multi-row inserts
(INSERT ... SELECT, INSERT ... VALUES (row1), (row2),..., COPY FROM
).
Raising this setting will dramatically increase the throughput, particularly where the network latency is high.
List the extension names in extensions parameter that have compatible behavior on both the local and remote servers
i.e.,
ALTER FOREIGN SERVER ... OPTIONS (extensions 'extension1,extension2');
The immutable functions and operators in those extensions are considered to bring the same result on the local and remote servers. As a result, execution of them will be shipped to the remote server.
This is particularly beneficial when those functions and operators are used in the WHERE clause. Those filters will be executed on the remote server and thus fewer rows are transfered.
Full-text search queries got much slower after inserting many new documents
When inserting data into an GIN index that has fastupdate enabled, the new index entries are not put into the index main structure. Instead, they are placed in the index's pending-list whose size is set by gin_pending_list_limit. Later, when the pending-list area becomes full, those pending-list entries are moved to the main index structure.
This is for good performance, because inserting one document involves many insertions into the main index, depending on the number of words in the document.
Full-text search queries scan the pending-list before the main index structure. Therefore, they are slow if the pending-list contains many pending entries.
Vacuum, including autovacuum, also moves the pending-list entries into the main index. So, the full-text search query will be faster after vacuum.
It is advised to tune autovacuum so that it runs reasonably frequently after inserting or updating documents.
The number of pending-list pages and tuples can be seen with this query (the pgstatginindex is in ltstattuple extension):
SELECT * FROM pgstatginindex('some_gin_index');
Fast random sampling of table rows
The traditional method is slow, because it scans and sorts the entire table.
SELECT * FROM mytable ORDER BY random() LIMIT 1;
Using TABLESAMPLE clause returns rows very quickly almost independently of the table size.
TABLESAMPLE fetches a sample portion of a table. Some built-in sampling methods are provided.
Also, The sampling method can be customized by adding an extension. For example, tsm_system_rows retrieves a specified number of random rows:
CREATE EXTENSION tsm_system_rows;
SELECT * FROM mytable TABLESAMPLE SYSTEM_ROWS(1);
SYSTEM_ROWS picks up a random block in the table's data file, and then fetchs rows sequentially in it. If more rows are necessary, additional blocks will be chosen.
Use huge pages
Set huge_pages = on.
This will reduce memory usage dramatically because the page gets smaller.
Also, improved performance can be expected thanks to the reduction of CPU's TLB cache misses.
"on" should be preferred for huge_pages to "try", considering the reduced memory usage and improved performance as a part of stable operation.
When huge_pages is set to "on", and the OS cannot allocate enough huge pages, Lightdb refuses to start emitting the following messages:
FATAL: could not map anonymous shared memory: Cannot allocate memory
HINT: This error usually means that Lightdb's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 1234567890 bytes), reduce Lightdb's shared memory usage, perhaps by reducing shared_buffers or max_connections.
In this case, reboot the OS or perform failover.
Tips for shared buffers
Avoid disk writes by client backends.
If there is no free shared buffer when the server process wants a new page, it has to evict a used buffer. If the evicted page is dirty, the server process needs to write the page to disk. This adds to the response time.
This undesirable situation can be detected by checking that the buffers_backend in pg_stat_bgwriter is high. If buffers_backend_fsync is also high, the situation is worse.
To alleviate this:
Make more free buffers: increase shared_buffers.
Make more clean buffers: increase bgwriter_lru_multiplier so that the background writer writes dirty buffers more aggressively. If the maxwritten_clean of pg_stat_bgwriter rises frequently, try increasing bgwriter_lru_maxpages.
Large shared buffers may be counterproductive.
The benefits of shared buffers is diminished on a host with high-performance storage.
That's because Lightdb uses the OS's filesystem cache: the data is cached in the filesystem cache as well as in shared buffers (double buffering).
Therefore, start with 25% of RAM for shared buffers, and then increase it up to around 40% as long as you can see some improvement.
However, some benchmark demonstrated that 64 GB or more can do harm.
Leverage lt_prewarm to quickly regain performance after failover.
After the database server restart or failover, the contents of shared buffers is empty or can be quite different from what was before the failover. Thus, application response times get worse until the shared buffers are warmed up.
Add lt_prewarm in shared_preload_libraries, and set lt_prewarm.autoprewarm to on.
This launches the autoprewarm worker, which periodically saves in a file the list of relation and block numbers cached in shared buffers. At server startup, lt_prewarm worker reads the file to refill shared buffers.
"SELECT * FROM some_table;"
does
not necessarily cache the entire table.
You may want to do this for performance test or application warmup, but it doesn't work. Also, it does not cache indexes at all.
If the size of a relation is larger than a quarter of shared buffers, its sequential scan only uses 256 KB of shared buffers.
The idea behind this is that a page that has been touched only by such a scan is unlikely to be needed again soon, so Lightdb tries to prevent such large sequential scans from evicting many useful pages out of the shared buffers.
Likewise, bulk writes, e.g., COPY FROM and CREATE TABLE AS SELECT, use only 16 MB of shared buffers.
To cache the entire relation, run
SELECT pg_prewarm('relation_name')
.
This works for indexes as well.
Tips for local memory
Setting enough work_mem requires try and error.
Unfortunately, there is no easy way to estimate a work_mem setting to avoid disk spilling.
The temporary file that log_temp_files shows is not sufficient. Additional overhead for buffering temporary data in memory must be included.
One way to estimate work_mem is to multiply the width and number of plan rows that are sorted or hashed, found in the query plan. Add some extra for overhead, say, further multiply it by 1.1 or so.
If parallel query is used, divide the result by (number of parallel workers used + 1). "+1" is for the parallel leader process.
Run EXPLAIN ANALYZE to see if external file is used. Try increasing work_mem until the use of external file disappears.
effective_cache_size does not allocate any memory.
This is only used to estimate the costs of index scans. The planner assumes this amount of memory is available for caching query data.
A higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.
Watch out for network latency when running lots of short SQL commands
Did your batch application, which issues a lot of small SQL statements in succession, get many times slower when you migrated it to a different environment?
That may be because the network latency is higher. Check to see if the network communication is slow.
Measure the round-trip time of a simple SQL, e.g.,
\timing on
SELECT 1;
Check if the wait events ClientRead and ClientWrite are increasing.
DECLARE CURSOR is fast. It creates a query plan, but does not calculate the result set. FETCH starts the calculation.
A cursor query is planned differently from a non-cursor query. You can see different query plans for the same SELECT statement.
Non-cursor queries are optimized for total runtime. The optimizer assumes that the client will consume the entire result set.
A sequential scan and sort is more likely to be chosen because the index scan is considered expensive.
Cursor queries are optimized for the runtime of startup and initial data retrieval. The optimizer assumes that the client will fetch only a fraction of the result set.
The optimizer goes for an index scan to speed up the creation of the first 10% of the data.
The "10%" can be configured with cursor_tuple_fraction parameter.
Utilize fast-path locks for high performance
If lots of concurrent short transactions each touch many relations, the lwlocks to protect the lock table can become a contention bottleneck. That contention is visible as the LWLock:LockManager wait event.
Although the lock table is divided into 16 partitions and they are covered by different lwlocks, hundreds of concurrent transactions can lead to waits on those lwlocks.
Fast-path locks come to the rescue:
Weak locks (Access Share, Row Share, and Row Exclusive modes) are taken using the fast-path lock mechanism. It doesn't use the lock table. Instead, those locks are recorded in the per-backend area in shared memory.
SELECT and DMLs take those weak locks, so they don't suffer from the lock manager lwlock contention.
However, fast-path locks cannot be used if either of the following is true:
The transaction already has 16 fast-path relation locks. The per-backend recording area is limited to 16 entries. Queries that access tables with many partitions and indexes, or join many tables will lose.
Some transaction tries to acquire a strong lock (Share, ShareRowExclusive, Exclusive, and AccessExclusiveLock modes).
Existing fast-path locks on the same relation are transferred to the lock table.
If someone has or requesting a strong lock, subsequent transactions which acquire a lock on a different relation may not be able to use the fast-path lock. This is because the presence of strong locks are managed using an array of 1024 integer counters, which are in effect a 1024-way partitioning of the lock space. If the requested weak lock is to be managed in the same partition as an existing strong lock, it cannot be fast-path.
The fast-path lock shows up in pg_locks as the fastpath column being true.
Take advantage of HOT (Heap-Only Tuple)
HOT speeds up UPDATEs.
What's wrong if HOT is not used?
Indexes will be bigger, because each row version has an index entry in every index. Index scans using those indexes will be slower, too.
WAL volume will be larger and update will be slower, because the update of any column inserts new entries into all indexes.
For HOT to work, both of the following conditions must be met:
The block containing the updated row has enough free space to accommodate the new row version.
The update does not modify any indexed column.
Then, what should I do?
Set fillfactor on the table to make room for new row versions.
e.g.,
CREATE TABLE mytable ... WITH (fillfactor = 90);
,
ALTER TABLE mytable SET (fillfactor = 90);
Lower fillfactor makes the table bigger, which results in shared buffer misses and longer sequential scans.
Maybe you should start with fillfactor = 90, and lower the setting if HOT is not working well.
Drop unnecessary indexes.
How do I know if HOT is working?
Check pg_stat_all_tables to make sure n_tup_hot_upd is high compared to n_tup_upd.
For best storage efficiency and performance, declare table columns from largest fixed length types (e.g., bigint, timestamp) to smallest fixed length types (e.g., smallint, bool), then variable length types (e.g., numeric, text, bytea)
Storage efficiency comes from the data alignment requirements.
For example, bigint is aligned on 8 byte boundary, while bool is aligned on 1 byte boundary.
In the following example, the former of the following returns 48, and the latter returns 39.
SELECT pg_column_size(ROW('true'::bool, '1'::bigint, '1'::smallint, '1'::int));
SELECT pg_column_size(ROW('1'::bigint, '1'::int, '1'::smallint, 'true'::bool));
The alignment requirements can be seen with:
SELECT typalign, typname FROM pg_type ORDER BY 1, 2;
Better performance comes from the aforementioned smaller data size, and direct column access.
If fixed-length columns are placed in front of the row, Lightdb can calculate and cache the positions of fixed-length columns in the row. So, a requested fixed-length column data of any row can be accessed directly using its offset.
Once a variable-length column appears, the positions of subsequent columns need to be calculated for each row, by adding up its columns' actual lengths. As a result, access to columns at the end of the row will be slow.
Use functions returning a composite type in FROM clause instead of SELECT column list
Suppose sample_func()'s return type is a composite type
(a int, b int, c int)
.
Bad: SELECT (sample_func()).*;
Good: SELECT * FROM sample_func();
In the bad case,
"(sample_func()).*"
is expanded
to "(sample_func()).a"
,
"(sample_func()).b"
,
"(sample_func()).c"
. Thus, the
function is called three times.
TOAST (The Oversized-Attribute Storage Technique)
This is a mechanism to store large values of up to 1 GB - 1.
A tuple cannot span multiple pages. Then, how is a column value stored that is larger than the page size (commonly 8 KB)?
Large column values of TOAST-able data types are compressed and/or broken up into chunks. Each chunk is stored as a separate row in the table's associated TOAST table. The chunk size is chosen so that four chunk rows will fit on a page. That is about 2,000 bytes for 8 KB page size.
A TOAST-able data type is the one which has a variable-length (varlena) representation. That is, a 1 or 4 byte varlena header followed by the column value. char(n) seems like fixed-length, but it has a varlena format.
TOAST table
Each table has 0 or 1 TOAST table and TOAST index.
The TOAST table and its index are created in CREATE/ALTER TABLE if needed.
The TOAST table is pg_toast.pg_toast_<main_table_OID>.
The TOAST index is pg_toast.pg_toast_<main_table_OID>_index.
The TOAST table's OID is stored in the table's pg_class.reltoastrelid.
The TOAST index's OID is stored in the TOAST table's pg_class.reltoastidxid.
Every TOAST table has these columns:
chunk_id OID: an OID identifying the particular TOASTed value
chunk_seq int: a sequence number for the chunk within its value
chunk_data bytea: the actual data of the chunk
Primary key (chunk_id, chunk_seq)
How TOAST works
It's triggered only when a row value to be stored in a table is wider than 2 KB (when the page size is 8 KB).
Compresses and/or moves column values to the TOAST table, until the row value is shorter than 2 KB (when the page size is 8 KB) or no more gains can be had. This 2 KB threshold can be adjusted for each table using the storage parameter toast_tuple_target in CREATE/ALTER TABLE.
Stores the TOASTed value's chunk_id in the main table's column. This is called a TOAST pointer.
The value storage strategy - whether it should be compressed
or moved to the TOAST table - can be chosen from four options
using
ALTER TABLE ALTER COLUMN column_name SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
The compression method can be chosen between pglz and lz4. It can be set for each column by using the COMPRESSION column option in CREATE/ALTER TABLE or otherwise the default_toast_compression parameter.
Insertion of a TOASTed value could become unsurprisingly slow.
This tends to be seen when the target table already has millions of TOASTed values, particularly after inserting lots of rows in succession.
Why?
Each TOASTed value is identified by an OID.
OID is an unsigned 4-byte value, which is generated from a cluster-wide counter that wraps around every 4 billion values. Therefore, a single table cannot have more than 2^32 (4 billion) TOASTed values.
When inserting a TOASTed value, Lightdb generates a new OID for it, checks if an existing TOASTed value in the target table already uses the same OID. If it's used, Lightdb generates the next OID and perform the check again. This is repeated until a free OID is found.
If successive OIDs are used in the target table, this retry takes a long time.
The remedy is to partition the table. Each partition has its own TOAST table. Thus, the likelihood of duplicate OID in each partition is reduced.
Killed (dead) index tuples can give mysterious query speedup
If you encounter varying execution times for the same execution plan of the same query, that may be thanks to killed index tuples.
Whenever an index scan fetches a heap tuple only to find that it is dead, it marks the index tuple as killed (dead). Then future index scans will ignore it. This avoids its index key comparison as well as its heap tuple fetch.
Subtransactions can be harmful
A subtransaction is a part of a transaction that can be rolled back without rolling back the main (top-level) transaction.
A subtransaction is started explicitly by a SAVEPOINT command, or implicitly when you enter a block with an EXCEPTION clause in PL/lightdb.
Some client drivers provide an option to start and end a subtransaction for every SQL statement, such as ltJDBC's connection parameter "autosave". Watch out for their default values.
Each subtransaction allocates its own XID when it performs an operation that needs an XID, such as modifying data or locking a row.
The tuple header's xmin and xmax fields record the XID of the subtransaction that updated it. For checking tuple visibility, a transaction that sees the xmin/xmax needs to know whether the main transaction, not the subtransaction, has ended.
How to know the main transaction of a subtransaction:
When a subtransaction assigns its XID, it records its direct parent's XID in $LTDATA/lt_subtrans/.
The structure of lt_subtrans is an array of XIDS. For example, the parent XID of XID 100 is stored in the 101st element of the array. The array is divided into 8 KB pages.
The lt_subtrans data is cached in a memory area of 32 pages. The area is managed by SLRU (simple least-recently used) buffers. So, The cache can contain 32 pages * 8 KB / 4 = 65,536 transactions.
Therefore, to get the main transaction's XID, as many entries as the subtransaction nesting depth need to be traversed.
Is lt_subtrans always examined for tuple visibility?
No. A snapshot stores not only main transactions' XIDs but also subtransactions' XIDs. If the checker's snapshot contains all subtransactions, it can get the job done without consulting lt_subtrans.
However, that's not always the case. Each backend can have at most 64 subtransaction XIDs in its ProcArray entry in shared memory. If the main transaction has more than 64 subtransactions, its ProcArray entry is marked overflowed.
When creating a snapshot, the ProcArray entries of all running transactions are scanned to collect the XIDs of main and sub transactions. If any entry is marked overflowed, the snapshot is marked suboverflowed.
A suboverflowed snapshot does not contain all data required to determine visibility, so the tuple's xmin/xmax must be traced back to their top-level transaction XID using lt_subtrans.
So, what's the problem?
The readers of lt_subtrans contend for lwlocks to protect the SLRU buffers with the writers, who register their parents' XID. The reader and writer takes Share and Exclusive mode locks respectively.
The lt_subtrans cache is not so big. Under many concurrent subtransactions, disk I/O arise.
How can I know the possibility of this happening?
The wait events LWLock:SubtransBuffer, LWLock:SubtransSLRU, IO:SLRURead, and IO:SLRUWrite keep growing.
pg_stat_slru shows increasing blks_read and blks_hit in its row for Subtrans.
MultiXact can harm performance under the hood
What is MultiXact?
A mechanism to record the XIDs of multiple lockers on a tuple. (Multi-transaction)
The xmax field in the tuple header records the XID that locks the tuple.
Then, what happens when multiple transactions acquire locks on the same tuple?
For example, the first transaction with XID 100 runs
SELECT ... FOR SHARE
. The xmax
becomes 100.
Next, the second transaction with XID 101 runs the
same SELECT ... FOR SHARE
on the
same tuple. Then, a new MultiXact ID, say 1, is
allocated and set to the xmax field.
The mapping from MultiXact ID 1 to the actual lockers' XIDS (100, 101) is added in $LTDATA/lt_multixact/.
The foreign key constraint is implemented as a constraint
trigger that executes
"SELECT ... FOR KEY SHARE"
.
Therefore, MultiXact may be used without your knowledge.
What could be the problem?
Like lt_subtrans, lt_multixact is cached through the SLRU. So, it can suffer from the lwlock contention and disk I/O.
When an XID is added as a new member of an existing MultiXact ID, a new MultiXact ID is allocated and existing member XIDs are copied to a new location. In the above example, when XID 102 joins MultiXact ID 1 with members (100, 101), MultiXact 2 is newly allocated, (100, 101) are copied there, and 102 is added. If many transactions lock the same row concurrently, this copy gets heavier.
How can I know the possibility of this happening?
The wait events LWLock:MultiXact*, IO:SLRURead, and IO:SLRUWrite keep growing.
pg_stat_slru shows increasing blks_read and blks_hit in its rows for MultiXactOffset and MultiXactMember.
pg_get_multixact_members('<MultiXact ID>') returns a set of member XIDs and their lock modes.
Overview of checkpoint
A processing to synchronize data both in memory and on storage by flushing unwritten (=dirty) cached data.
When is it performed?
The time specified by checkpoint_timeout has passed since the last checkpoint.
A certain amount of WAL has accumulateed, which is based on max_wal_size.
At the start of a base backup (pg_basebackup, pg_start_backup()).
Shutting down the database instance.
Completing any form of recovery.
Other miscellaneous required timings such as CREATE DATABASE, so that data files can be copied/moved without going through shared buffers.
The checkpoint caused by checkpoint_timeout is called a scheduled checkpoint, while others are called a requested checkpoint.
When finishing a checkpoint, old WAL segment files are removed or recycled as new WAL segment files for future reuse, based on min_wal_size.
Here, "old" means "no longer necessary for crash recovery because all the changes in those WAL segments have been persisted to data files."
However, old WAL segment files are kept until they are archived and no longer needed by wal_keep_size or any replication slots.
Checkpoint is intrusive because:
Storage I/O contention, for both data and WAL.
Buffer content lock lwlock contention: While the checkpointer is flushing a shared buffer with its buffer content lock held in Share mode, a transaction that modifies the same buffer, which requires an Exclusive lock, needs to wait for the lwlock to be released.
WAL volume increase due to full page writes.
What is full page writes?
During the first modification to each data page after a checkpoint, the entire page content is WAL-logged instead of just the change. This is necessary to recover a torn page during recovery.
A torn page can result if the host crashes while Lightdb is writing a page. Because the atomic unit of I/O is usually smaller (say, 512 byte disk sector) than the Lightdb page size (commonly 8 KB), it could be possible that part of a page is new and the other is old.
Reducing the impact of checkpoints
Monitor the frequency of checkpoints
The number of scheduled and requested checkpoints can be seen by checkpoints_timed and checkpoints_req respectively in pg_stat_bgwriter.
The vast majority of checkpoints should be scheduled rather than requested. Scheduled checkpoints allow the load to be evenly spread throughout the normal operation of the system. Frequent requested checkpoints are likely to cause variations in performance.
The server log shows the following messages, if the elapsed time between two successive checkpoints is shorter than checkpoint_warning and the newer one is requested by WAL accumulation.
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".
Lower the frequency of checkpoints by increasing max_wal_size and/or checkpoint_timeout.
Note that this can increase the amount of time needed for crash recovery.
Set wal_compression to on. This reduces the WAL for full page writes.
Increase min_wal_size. This reduces the need for transactions to create new WAL segment files.
Disadvantages of indexes
Indexes consume disk space.
Larger disk space increases the size and duration of physical backups.
Indexes slow down INSERT/DELETE/COPY statements because they always have to modify all indexes.
Indexes prevent HOT updates. HOT works only for modifications to non-indexed columns.
Benefits of indexes you might not notice
B-tree indexes can speed up the max() and min() aggregates. They can just read the index entries at the end of the index.
Indexes on expressions also gather statistics on the calculated values of the expression.
ex.
CREATE INDEX myindex1 ON mytable ((col1 + col2 * 3));
You can see the statistics of indexed expressions. For example, in the above case, the statistics appear in pg_stats as tablename=myindex1 and attname=expr.
The statistics target can be set for indexed expressions.
e.g.,
ALTER INDEX index_name ALTER COLUMN expr SET STATISTICS 1000;
Indexes on foreign keys speed up constraint processing.
ex.
CREATE TABLE orders (..., product_id int REFERENCES products ON CASCADE DELETE);
You can see the time taken for the constraint cascade processing with EXPLAIN ANALYZE and auto_explain. Foreign key constraints are implemented using triggers internally.
ex.
EXPLAIN ANALYZE DELETE FROM products WHERE product_id = 2;
...
Trigger for constraint orders_product_id_fkey: time=0.322 calls=1
Making an index-only scan work
Use EXPLAIN ANALYZE to see how many times the index-only scan had to read the heap. For example, it shows something like "Heap Fetches: 0". 0 is the best.
Make autovacuum more aggressive or run VACUUM to update the visibility map. That would reduce the heap fetches.
Pitfalls of ANALYZE
Autovacuum does not run ANALYZE on temporary tables or foreign tables. Manually ANALYZE them.
The query plan can change after ANALYZE even when the table content hasn't changed.
ANALYZE takes a random sample of the table contents (300 x default_statistics_target rows). Hence, the collected statistics can vary depending on which rows are read.
To avoid or reduce this query plan variance, do either of:
Fix the query plan using third-party software like lt_hint_plan.
Raise the amount of statistics collected by ANALYZE,
i.e.,
ALTER TABLE ... ALTER COLUMN ... SET STATISTICS
.
The more rows are used, the less the statistics
fluctuation would be. However, this will make the
ANALYZE and query planning slower because more
statistics are written or read.
Set the table's storage parameters autovacuum_analyze_threshold and autovacuum_analyze_scale_factor to large values, so that autovacuum won't practically ANALYZE it. Then, do manual ANALYZE if needed.
Use of a set-returning function could lead to a poor query plan
This is likely to be observed when the function is used to filter rows in WHERE clause or join.
That's because the planner does not have reasonably accurate information about selectivity. Thus, its cost estimate would be inaccurate.
CREATE/ALTER FUNCTION can set a fixed cost and the number of rows it returns. The planner support function given by the SUPPORT clause, which needs to be written in C, can change the cost and rows dynamically.
CREATE FUNCTION ... RETURNS {SETOF ... | TABLE(...)} COST execution_cost ROWS result_rows SUPPORT support_function
Custom plan and generic plan
PREPARE performs parse, analysis, and rewrite to generate a prepared statement.
ex.
PREPARE stmt(int) AS SELECT * FROM mytable WHERE col = $1;
EXECUTE makes a query plan and execute it.
ex. EXECUTE stmt(123);
A query plan that takes specific parameter values into account is the best. Such plans are called a custom plan. On the other hand, a query plan that doesn't consider parameter values is called a generic plan.
You can tell a custom plan from a generic plan by the presence of a placeholder. For instance,
Custom plan: Filter: (col = 123)
Generic plan: Filter: (col = $1)
But planning is costly. If the generic plan is good enough, Lightdb uses it to avoid making custom plans.
Lightdb uses a custom plan for the first five executions of a prepared statement.
On the sixth execution, a generic plan is generated, and its cost is compared with the average cost of the past five executions.
If the cost of the generic plan is cheaper, it continues to adopt it. Custom plans won't be considered.
Otherwise, a new custom plan is created and used. On subsequent executions, the cost of the generic plan is compared with the average cost of all past executions of custom plans, and whichever is cheaper is chosen.
You can force a generic or custom plan by setting plan_cache_mode to force_generic_plan or force_custom_plan respectively. This might be necessary to force custom plans, if the cost estimate of the generic plan is underestimated.
Lightdb Documentation