3.2. Architecture

3.2.1. Client-Server architecture
3.2.2. Logical database structures
3.2.3. Database object hierarchy
3.2.4. Object identifier (OID)
3.2.5. Physical database structures
3.2.6. Instance
3.2.7. Reading and writing database data
3.2.8. Query processing
3.2.9. References

3.2.1. Client-Server architecture

  • Server

    • Database server: Lightdb

    • Server application: ex. lt_initdb, lt_ctl, lt_upgrade

  • Client

    • Client interface: ex. libpq, ECPG, ltJDBC, psqlODBC, Nlightdb

    • Client application: ex. ltsql, ltbench, lt_dump, lt_restore

  • Frontend/Backend protocol

    • Frontend=client, Backend=server

    • message-based protocol for communication between frontends and backends over TCP/IP and Unix-domain sockets

  • Compatibility between client and server

    • ltsql works best with servers of the same or an older major version. It also works with servers of a newer major version, although backslash commands are likely to fail.

    • Driver is server-version agnostic: Always use the latest driver version

3.2.2. Logical database structures

  • Database cluster is a collection of databases, roles, and tablespaces

  • The database cluster initially contains some databases after lt_initdb:

    • template1: a new database is cloned from this, unless another template database is specified

    • template0: a pristine copy of the original contents of template1

    • postgres: default database used by utilities and users

  • Each database contains its own system catalogs that store the metadata of its local database objects

  • The database cluster contains some shared system catalogs that store the metadata of the cluster-wide global objects

    • Shared system catalogs can be accessed from inside each database

    • Query to get shared system catalogs: SELECT relname FROM pg_class WHERE relisshared AND relkind = 'r';

    • ex. pg_authid, pg_database, pg_tablespace

  • Tablespace

    • pg_global ($LTDATA/global/): store shared system catalogs

    • pg_default ($LTDATA/base/): store template0, template1, postgres. The default tablespace for other databases.

    • User tablespaces: created with CREATE TABLESPACE name LOCATION 'dir_path';

3.2.3. Database object hierarchy

  • Database

    • Access method

    • Cast

    • Event trigger

    • Extension

    • Foreign-data wrapper

    • Foreign server

    • Procedural language

    • Publication

    • Row-level security policy (the name must be distinct from that of any other policy for the table)

    • Rule (the name must be distinct from that of any other rule for the same table)

    • Schema

      • Aggregate function

      • Collation

      • Conversion

      • Data type

      • Domain

      • Extended statistics

      • Foreign table

      • Function

      • Index

      • Operator

      • Operator class

      • Operator family

      • Procedure

      • Sequence

      • Table

      • Text search configuration

      • Text search dictionary

      • Text search parser

      • Text search template

      • Trigger (inherits the schema of its table)

      • View

    • Subscription

    • Transform

    • User mapping

  • Role

  • Tablespace

3.2.4. Object identifier (OID)

  • OIDs are used internally by Lightdb as primary keys for various system tables.

    • ex. SELECT oid, relname FROM pg_class WHERE relname = 'mytable';

  • Type oid represents an OID.

  • oid is an unsigned four-byte integer.

  • An OID is allocated from a single cluster-wide counter, so it is not large enough to provide database-wide uniqueness.

    • A specific object is identified by two OIDs (classid and objid) in pg_depend and pg_shdepend.

3.2.5. Physical database structures

Directories

  • Data directory ($LTDATA)

    • base/: Subdirectory containing per-database subdirectories

    • global/: Subdirectory containing cluster-wide tables, such as pg_database

    • lt_multixact/: Subdirectory containing multitransaction status data (used for shared row locks)

    • lt_subtrans/: Subdirectory containing subtransaction status data

    • lt_tblspc/: Subdirectory containing symbolic links to tablespaces

    • lt_wal/: Subdirectory containing WAL (Write-Ahead Log) files

    • lt_xact/: Subdirectory containing transaction commit status data

  • Configuration file directories (optional)

  • Tablespace directories (optional)

  • WAL directory (optional)

Files in data directory

  • Configuration files (lightdb.conf, lt_hba.conf, lt_ident.conf): Can be stored in other directories

  • Control file (global/lt_control): Stores control info such as the cluster state, checkpoint log location, next OID, next XID

  • Regular relation data file

    • A relation is a set of tuples: table, index, sequence, etc.

    • Each relation has its own set of files.

    • Each file consists of 8 KB blocks.

    • Lazy allocation: A new heap table file contains 0 blocks, while a new B-tree index file contains 1 block.

    • There are some types of data files (forks): main, FSM, VM, initialization

    • Main fork (base/<database_OID>/<relation_filenode_no>)

      • ex. "SELECT pg_relation_filepath('mytable');" returns base/17354/32185, where 17354 is the database's OID and 32185 is the mytable's filenode number

      • Stores tuple data.

    • FSM (free space map) fork (base/<database_OID>/<relation_filenode_no>_fsm)

      • Keeps track of free space in the relation.

      • Entries are organized as a tree, where each leaf node entry stores free space in one relation block.

      • lt_freespacemap and pageinspect can be used to examine its contents.

    • VM (visibility map) fork (base/<database_OID>/<relation_filenode_no>_vm)

      • Keeps track of:

        • which pages contain only tuples that are known to be visible to all active transactions

        • which pages contain only frozen tuples

      • Each heap relation has a Visibility Map; an index does not have one.

      • Stores two bits per heap page:

        • All-visible bit: if set, the page does not contain any tuples that need to be vacuumed. Also used by index-only scans to answer queries using only the index tuple.

        • All-frozen bit: if set, all tuples on the page have been frozen, therefore vacuum can skip the page.

      • lt_visibility can be used to examine its contents.

    • Initialization fork (base/<database_OID>/<relation_filenode_no>_init)

      • Each unlogged table and index has an initialization fork.

      • The content is empty: table is 0 block, index is 1 block.

      • Unlogged relations are reset during recovery: the initialization fork is copied over the main fork, and other forks are erased.

  • Temporary relation data file

    • base/<database_OID>/tBBB_FFF

      • BBB is the backend ID of the backend which created the file, and FFF is the filenode number

      • ex. base/5/t3_16450

    • Has main, FSM, and VM forks, but not the initialization fork.

  • A large relation is divided into 1 GB segment files.

    • e.g., 12345, 12345.1, 12345.2, ...

Page (= block)

  • Each page is 8 KB. Configurable when building Lightdb.

  • Relations have the same format.

  • The content is the same in memory and on storage.

  • Each page stores multiple data values called items. In a table, an item is a row; in an index, an item is an index entry.

  • pageinspect can be used to examine the content.

  • Layout of a page:

    1. Page header: 24 bytes

    2. An array of item identifiers pointing to the actual items: Each entry is an (offset,length) pair. 4 bytes per item.

    3. Free space

    4. Items

    5. Special space: 0 byte for tables, different bytes for index types (btree, GIN, etc.)

Table row

  • pageinspect can be used to examine the content.

  • Layout of a row

    1. Header: 23 bytes

    2. Null bitmap (optional): 1 bit for each column

    3. User data: columns of the row

3.2.6. Instance

The instance is a group of server-side processes, their local memory, and the shared memory.

Processes

  • Single-threaded: postmaster launches a single backend process for each client connection. Thus, each SQL execution only uses a single CPU core. Parallel query, index build, VACUUM etc. can utilize multiple CPU cores by running multiple server processes.

  • postmaster: The parent of all server processes. Controls the server startup and shutdown. Create shared memory and semaphores. Launches other server processes and reaps dead ones. Opens and listens on TCP ports and/or Unix domain sockets, accepts connection requests, and spawns client backends to pass the connection requests to.

  • (Client) backend: Acts on behalf of a client session and handles its requests, i.e., executes SQL commands.

  • Background processes

    • logger: Catches all stderr output from other processes through pipes, and writes them to log files.

    • checkpointer: Handles all checkpoints.

    • background writer: Periodically wakes up and writes out dirty shared buffers so that other processes don't have to write them when they need to free a shared buffer to read in another page.

    • startup: Performs crash and point-in-time recovery. Ends as soon as the recovery is complete.

    • walwriter: Periodically wakes up and writes out WAL buffers to reduce the amount of WAL that other processes have to write. Also ensures the writes of commit WAL records from asynchronously committed transactions.

    • archiver: Archives WAL files.

    • autovacuum launcher: Always running when autovacuum is enabled. Schedules autovacuum workers to run.

    • autovacuum worker: Connect to a database as determined in the launcher, examine system catalogs to select the tables, and vacuum them.

    • parallel worker: Executes part of a parallel query plan.

    • walreceiver: Runs on the standby server. Receives WAL from the walsender, stores it on disk, and tells the startup process to continue recovery based on it.

    • walsender: Runs on the primary server. Sends WAL to a single walreceiver.

    • logical replication launcher: Run on the subscriber. Coordinates logical replication workers to run.

    • logical replication worker: Runs on the subscriber. An apply worker per subscription receives logical changes from walsender on the publisher and applies them. One or more tablesync workers perform initial table copy for each table.

  • Background worker: Runs system-supplied or user-supplied code. e.g., used for parallel query and logical replication.

Memory

  • Shared memory

    • Shared buffers: Stores the cached copy of data files (main, FSM, and VM forks).

    • WAL buffers: Transactions put WAL records here before writing them out to disk.

    • Other various areas: One large shared memory segment is divided into areas for specific uses.

    • The allocations of areas can be examined with pg_shmem_allocations.

  • Local memory

    • Work memory: Allocated for a query operation such as sort and hash. Configured with work_mem and hash_mem_multiplier parameters.

    • Maintenance work memory: Allocated for maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE. Configured with maintenance_work_mem parameter.

    • Temporary buffers: Allocated for caching temporary table blocks. Configured with temp_buffers parameter.

    • Other various areas: A memory context is allocated for a specific usage (e.g., a message from client, transaction, query plan, execution state). Hundreds of memory contexts per session are possible.

3.2.7. Reading and writing database data

  • Read:

    • First, search the shared buffers for a buffer containing the target block. If found, it's returned to the requester.

    • Otherwise, allocate a buffer from a free buffer list, and read the target block from the data file into the buffer.

    • If there's no free buffer, evict and use a used buffer. If it's dirty, writes out the buffer to disk.

  • Write

    • Find the target shared buffer, modify its contents, and write the changes to the WAL buffers.

    • The modifying transaction writes out its WAL records from the WAL buffers to disk, including the commit WAL record.

    • The modified dirty shared buffers are flushed to disk by background writer, checkpointer, or any other process. This is asynchronous with the transaction completion.

  • Any backend can read and write shared buffers, WAL buffers, data and WAL files. Unlike some other DBMSs, writes are not exclusively performed by a particular background process.

  • The database data file is read and written one block at a time. There's no multiblock I/O.

  • Some operations bypass shared buffers: the write of an index during index creation, CREATE DATABASE, ALTER TABLE ... SET TABLESPACE

3.2.8. Query processing

  1. A client connects to a database, sends a query (SQL command) to the server, and receives the result.

  2. The parser first checks the query for correct syntax. Then, it interprets the semantics of the query to understand which tables, views, functions, data types, etc. are referenced.

  3. The rewrite system (rewriter) transforms the query based on the rules stored in the system catalog pg_rewrite. One example is the view: a query that accesses a view is rewritten to use the base table.

  4. The planner/optimizer creates a query plan.

  5. The executor executes the query plan and returns the result set to the client.

Notes

  • Each session runs on a connection to a single database. Therefore, it cannot access tables on a different database. However, one session can connect to another database and create another session via a foreign data wrapper like postgres_fdw, and access tables there. For example, an SQL command can join one table on the local database and another table on a remote database.

  • Each SQL command basically uses only one CPU core. A parallel query and some utility commands such as CREATE INDEX and VACUUM can use multiple CPU cores by running background workers.

3.2.9. References

Lightdb Documentation