Server
Database server: Lightdb
Server application: ex. lt_initdb, lt_ctl, lt_upgrade
Client
Client interface: ex. libpq, ECPG, ltJDBC, psqlODBC, Nlightdb
Client application: ex. ltsql, ltbench, lt_dump, lt_restore
Frontend/Backend protocol
Frontend=client, Backend=server
message-based protocol for communication between frontends and backends over TCP/IP and Unix-domain sockets
Compatibility between client and server
ltsql works best with servers of the same or an older major version. It also works with servers of a newer major version, although backslash commands are likely to fail.
Driver is server-version agnostic: Always use the latest driver version
Database cluster is a collection of databases, roles, and tablespaces
The database cluster initially contains some databases after lt_initdb:
template1: a new database is cloned from this, unless another template database is specified
template0: a pristine copy of the original contents of template1
postgres: default database used by utilities and users
Each database contains its own system catalogs that store the metadata of its local database objects
The database cluster contains some shared system catalogs that store the metadata of the cluster-wide global objects
Shared system catalogs can be accessed from inside each database
Query to get shared system catalogs:
SELECT relname FROM pg_class WHERE relisshared AND relkind = 'r';
ex. pg_authid, pg_database, pg_tablespace
Tablespace
pg_global ($LTDATA/global/): store shared system catalogs
pg_default ($LTDATA/base/): store template0, template1, postgres. The default tablespace for other databases.
User tablespaces: created with
CREATE TABLESPACE name LOCATION 'dir_path';
Database
Access method
Cast
Event trigger
Extension
Foreign-data wrapper
Foreign server
Procedural language
Publication
Row-level security policy (the name must be distinct from that of any other policy for the table)
Rule (the name must be distinct from that of any other rule for the same table)
Schema
Aggregate function
Collation
Conversion
Data type
Domain
Extended statistics
Foreign table
Function
Index
Operator
Operator class
Operator family
Procedure
Sequence
Table
Text search configuration
Text search dictionary
Text search parser
Text search template
Trigger (inherits the schema of its table)
View
Subscription
Transform
User mapping
Role
Tablespace
OIDs are used internally by Lightdb as primary keys for various system tables.
ex.
SELECT oid, relname FROM pg_class WHERE relname = 'mytable';
Type oid represents an OID.
oid is an unsigned four-byte integer.
An OID is allocated from a single cluster-wide counter, so it is not large enough to provide database-wide uniqueness.
A specific object is identified by two OIDs (classid and objid) in pg_depend and pg_shdepend.
Directories
Data directory ($LTDATA)
base/: Subdirectory containing per-database subdirectories
global/: Subdirectory containing cluster-wide tables, such as pg_database
lt_multixact/: Subdirectory containing multitransaction status data (used for shared row locks)
lt_subtrans/: Subdirectory containing subtransaction status data
lt_tblspc/: Subdirectory containing symbolic links to tablespaces
lt_wal/: Subdirectory containing WAL (Write-Ahead Log) files
lt_xact/: Subdirectory containing transaction commit status data
Configuration file directories (optional)
Tablespace directories (optional)
WAL directory (optional)
Files in data directory
Configuration files (lightdb.conf, lt_hba.conf, lt_ident.conf): Can be stored in other directories
Control file (global/lt_control): Stores control info such as the cluster state, checkpoint log location, next OID, next XID
Regular relation data file
A relation is a set of tuples: table, index, sequence, etc.
Each relation has its own set of files.
Each file consists of 8 KB blocks.
Lazy allocation: A new heap table file contains 0 blocks, while a new B-tree index file contains 1 block.
There are some types of data files (forks): main, FSM, VM, initialization
Main fork
(base/<database_OID>/<relation_filenode_no>
)
ex.
"SELECT pg_relation_filepath('mytable');"
returns base/17354/32185
, where
17354 is the database's OID and 32185 is the mytable's
filenode number
Stores tuple data.
FSM (free space map) fork
(base/<database_OID>/<relation_filenode_no>_fsm
)
Keeps track of free space in the relation.
Entries are organized as a tree, where each leaf node entry stores free space in one relation block.
lt_freespacemap and pageinspect can be used to examine its contents.
VM (visibility map) fork
(base/<database_OID>/<relation_filenode_no>_vm
)
Keeps track of:
which pages contain only tuples that are known to be visible to all active transactions
which pages contain only frozen tuples
Each heap relation has a Visibility Map; an index does not have one.
Stores two bits per heap page:
All-visible bit: if set, the page does not contain any tuples that need to be vacuumed. Also used by index-only scans to answer queries using only the index tuple.
All-frozen bit: if set, all tuples on the page have been frozen, therefore vacuum can skip the page.
lt_visibility can be used to examine its contents.
Initialization fork (base/<database_OID>/<relation_filenode_no>_init)
Each unlogged table and index has an initialization fork.
The content is empty: table is 0 block, index is 1 block.
Unlogged relations are reset during recovery: the initialization fork is copied over the main fork, and other forks are erased.
Temporary relation data file
base/<database_OID>/tBBB_FFF
BBB is the backend ID of the backend which created the file, and FFF is the filenode number
ex. base/5/t3_16450
Has main, FSM, and VM forks, but not the initialization fork.
A large relation is divided into 1 GB segment files.
e.g., 12345, 12345.1, 12345.2, ...
Page (= block)
Each page is 8 KB. Configurable when building Lightdb.
Relations have the same format.
The content is the same in memory and on storage.
Each page stores multiple data values called items. In a table, an item is a row; in an index, an item is an index entry.
pageinspect can be used to examine the content.
Layout of a page:
Page header: 24 bytes
An array of item identifiers pointing to the actual items: Each entry is an (offset,length) pair. 4 bytes per item.
Free space
Items
Special space: 0 byte for tables, different bytes for index types (btree, GIN, etc.)
Table row
pageinspect can be used to examine the content.
Layout of a row
Header: 23 bytes
Null bitmap (optional): 1 bit for each column
User data: columns of the row
The instance is a group of server-side processes, their local memory, and the shared memory.
Processes
Single-threaded: postmaster launches a single backend process for each client connection. Thus, each SQL execution only uses a single CPU core. Parallel query, index build, VACUUM etc. can utilize multiple CPU cores by running multiple server processes.
postmaster: The parent of all server processes. Controls the server startup and shutdown. Create shared memory and semaphores. Launches other server processes and reaps dead ones. Opens and listens on TCP ports and/or Unix domain sockets, accepts connection requests, and spawns client backends to pass the connection requests to.
(Client) backend: Acts on behalf of a client session and handles its requests, i.e., executes SQL commands.
Background processes
logger: Catches all stderr output from other processes through pipes, and writes them to log files.
checkpointer: Handles all checkpoints.
background writer: Periodically wakes up and writes out dirty shared buffers so that other processes don't have to write them when they need to free a shared buffer to read in another page.
startup: Performs crash and point-in-time recovery. Ends as soon as the recovery is complete.
walwriter: Periodically wakes up and writes out WAL buffers to reduce the amount of WAL that other processes have to write. Also ensures the writes of commit WAL records from asynchronously committed transactions.
archiver: Archives WAL files.
autovacuum launcher: Always running when autovacuum is enabled. Schedules autovacuum workers to run.
autovacuum worker: Connect to a database as determined in the launcher, examine system catalogs to select the tables, and vacuum them.
parallel worker: Executes part of a parallel query plan.
walreceiver: Runs on the standby server. Receives WAL from the walsender, stores it on disk, and tells the startup process to continue recovery based on it.
walsender: Runs on the primary server. Sends WAL to a single walreceiver.
logical replication launcher: Run on the subscriber. Coordinates logical replication workers to run.
logical replication worker: Runs on the subscriber. An apply worker per subscription receives logical changes from walsender on the publisher and applies them. One or more tablesync workers perform initial table copy for each table.
Background worker: Runs system-supplied or user-supplied code. e.g., used for parallel query and logical replication.
Memory
Shared memory
Shared buffers: Stores the cached copy of data files (main, FSM, and VM forks).
WAL buffers: Transactions put WAL records here before writing them out to disk.
Other various areas: One large shared memory segment is divided into areas for specific uses.
The allocations of areas can be examined with pg_shmem_allocations.
Local memory
Work memory: Allocated for a query operation such as sort and hash. Configured with work_mem and hash_mem_multiplier parameters.
Maintenance work memory: Allocated for maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE. Configured with maintenance_work_mem parameter.
Temporary buffers: Allocated for caching temporary table blocks. Configured with temp_buffers parameter.
Other various areas: A memory context is allocated for a specific usage (e.g., a message from client, transaction, query plan, execution state). Hundreds of memory contexts per session are possible.
Read:
First, search the shared buffers for a buffer containing the target block. If found, it's returned to the requester.
Otherwise, allocate a buffer from a free buffer list, and read the target block from the data file into the buffer.
If there's no free buffer, evict and use a used buffer. If it's dirty, writes out the buffer to disk.
Write
Find the target shared buffer, modify its contents, and write the changes to the WAL buffers.
The modifying transaction writes out its WAL records from the WAL buffers to disk, including the commit WAL record.
The modified dirty shared buffers are flushed to disk by background writer, checkpointer, or any other process. This is asynchronous with the transaction completion.
Any backend can read and write shared buffers, WAL buffers, data and WAL files. Unlike some other DBMSs, writes are not exclusively performed by a particular background process.
The database data file is read and written one block at a time. There's no multiblock I/O.
Some operations bypass shared buffers: the write of an index during index creation, CREATE DATABASE, ALTER TABLE ... SET TABLESPACE
A client connects to a database, sends a query (SQL command) to the server, and receives the result.
The parser first checks the query for correct syntax. Then, it interprets the semantics of the query to understand which tables, views, functions, data types, etc. are referenced.
The rewrite system (rewriter) transforms the query based on the rules stored in the system catalog pg_rewrite. One example is the view: a query that accesses a view is rewritten to use the base table.
The planner/optimizer creates a query plan.
The executor executes the query plan and returns the result set to the client.
Notes
Each session runs on a connection to a single database. Therefore, it cannot access tables on a different database. However, one session can connect to another database and create another session via a foreign data wrapper like postgres_fdw, and access tables there. For example, an SQL command can join one table on the local database and another table on a remote database.
Each SQL command basically uses only one CPU core. A parallel query and some utility commands such as CREATE INDEX and VACUUM can use multiple CPU cores by running background workers.
Lightdb Documentation