lt_distributed_probackup.py

lt_distributed_probackup.py — backup and restore a LightDB distributed cluster

lt_distributed_probackup

lt_distributed_probackup.py is a utility to manage backup and recovery of LightDB Distributed database clusters. It is designed to perform periodic backups of the LightDB instance that enable you to restore the server in case of a failure. Python version need to be python3.

lt_distributed_probackup.py offers the following benefits like lt_probackup that can help you implement different backup strategies and deal with large amounts of data:

  • Incremental backup: page-level incremental backup allows you to save disk space, speed up backup and restore. With three different incremental modes, you can plan the backup strategy in accordance with your data flow.

  • Incremental restore: page-level incremental restore allows you dramatically speed up restore by reusing valid unchanged pages in destination directory.

  • Merge: using this feature allows you to implement "incrementally updated backups" strategy, eliminating the need to do periodical full backups.

  • Validation: automatic data consistency checks and on-demand backup validation without actual data recovery

  • Verification: on-demand verification of LightDB distributed instance with the checkdb command.

  • Retention: managing WAL archive and backups in accordance with retention policy. You can configure retention policy based on recovery time or the number of backups to keep, as well as specify time to live (TTL) for a particular backup. Expired backups can be merged or deleted.

  • Parallelization: running backup, restore, merge, delete, verificaton and validation processes on multiple parallel threads

  • Compression: storing backup data in a compressed state to save disk space

  • Deduplication: saving disk space by not copying unchanged non-data files, such as _vm or _fsm

  • Remote operations: backing up LightDB distributed instance located on a remote system or restoring a backup remotely

  • Backup from standby: avoid extra load on master by taking backups from a standby server

  • External directories: backing up files and directories located outside of the LightDB data directory (LTDATA), such as scripts, configuration files, logs, or SQL dump files.

  • Backup Catalog: get list of backups and corresponding meta information in plain text or JSON formats

  • Archive catalog: getting the list of all WAL timelines and the corresponding meta information in plain text or JSON formats

  • Partial Restore: restore only the specified databases or exclude the specified databases from restore.

To manage backup data, lt_distributed_probackup.py creates a backup catalog. This directory stores all backup files with additional meta information, as well as WAL archives required for point-in-time recovery. You can store backups for different instances in separate subdirectories of a single backup catalog.

Using lt_distributed_probackup.py, you can take full or incremental backups:

  • Full backups contain all the data files required to restore the database instance from scratch.

  • Incremental backups only store the data that has changed since the previous backup. It allows to decrease the backup size and speed up backup operations. lt_distributed_probackup.py supports the following modes of incremental backups:

    • PAGE backup. In this mode, lt_distributed_probackup.py scans all WAL files in the archive from the moment the previous full or incremental backup was taken. Newly created backups contain only the pages that were mentioned in WAL records. This requires all the WAL files since the previous backup to be present in the WAL archive. If the size of these files is comparable to the total size of the database cluster files, speedup is smaller, but the backup still takes less space.

    • DELTA backup. In this mode, lt_distributed_probackup.py read all data files in LTDATA directory and only those pages, that where changed since previous backup, are copied. Continuous archiving is not necessary for it to operate. Also this mode could impose read-only I/O pressure equal to Full backup.

    • PTRACK backup. In this mode, LightDB tracks page changes on the fly. Continuous archiving is not necessary for it to operate. Each time a relation page is updated, this page is marked in a special PTRACK bitmap for this relation. As one page requires just one bit in the PTRACK fork, such bitmaps are quite small. Tracking implies some minor overhead on the database server operation, but speeds up incremental backups significantly.

Regardless of the chosen backup type, all backups taken with lt_distributed_probackup.py support the following strategies of WAL delivery:

  • Autonomous backups streams via replication protocol all the WAL files required to restore the cluster to a consistent state at the time the backup was taken. Even if continuous archiving is not set up, the required WAL segments are included into the backup.

  • Archive backups rely on continuous archiving.

lt_distributed_probackup.py will use lt_probackup to manage backup and recovery of LightDB Distributed database clusters. It has almost the same usage as lt_probackup and fully supports the functions supported by lt_probackup. The specific differences are as follows

different from lt_probackup

lt_distributed_probackup.py use connection option to get datanode info from coordinator when add-instance and backup.

lt_distributed_probackup.py find matched datanode instance name in backup path by coordinator's instance name specified by '--instance', when restore, set-config, show-config, validate, checkdb, show, delete and del-instance

To set different options for coordinator and datanode, some options can be set to a list, separated by semicolons. If it is not set to a list, than all node will set same option.

'-i --backup-id' can be a distributed backup id, you can see distributed backup id by show command. By specify it, you can operator dn's backup at the same time. This is supported from version 23.1 onwards. old backups status will all be 'ERROR' in show command without '--detail'.

log options

  • Log options is used to control lt_distributed_probackup.py logs, but not lt_probackup logs.

  • Add option '--lt-probackup-log-level' for change lt_probackup's console log level.

  • By default, file logging is enabled. The log level is verbose and the log name is 'lt_distributed_probackup-%Y-%m-%d.log'. Default log directory is 'BACKUP_PATH/log'. but when use backup command with '--with-init', 'BACKUP_PATH' may not exist, log directory will be '/tmp/ltAdminLogs'.

  • Not support '--error-log-filename', '--log-rotation-size' and '--log-rotation-age'. '--log-rotation-size' and '--log-rotation-age' is used for an individual log file, but by default it is not an individual log file.

  • '--log-level-console' and '--log-rotation-age' is used for an individual log file, but by default it is not an individual log file.

add-instance

  • Add connection option for connect to coordinator node. if it is a primary, it will add-isntance for all primary node, otherwise it will add-isntance for all standby node.

  • '--remote-host' will be ignored, because it will be same with '-h'.

  • '--remote-path' can be a list.

  • Add option '--no_distribution' for execute like lt_probackup.

  • After execute add-instance command, it will output datanode's instance name generated. In datanode's archive_comand, the '--instance' must be same with it, or you can get it by 'cn_instance_name_dn_id'(for example: 'cn_dn_1'), 'cn_instance_name' is then coordinator instance name specified by '--instacne', id is the node_id in pg_dist_node.

backup

  • Use connection option to connect to coordinator node. if it is a primary, it will backup for all primary node, otherwise it will backup for all standby node.

  • '--remote-host' will be ignored, because it will be same with '-h'.

  • '-D pgdata-path' will be ignored, because it have been add to config in add-instance.

  • Add option '--with-init, when set, backup will init and add-instance if it is not exist.

  • Add option '--parallel-num' to execute lt_probackup concurrently. default num is 1.

restore

  • By specify cn instance name, it will get all datanode instance name, and restore them.

  • Not support '–-db-include' and '–-db-exclude' yet.

  • '-D pgdata-path', '-i backup_id', '--recovery-taget-xid', '--recovery-taget-lsn', '--recovery-taget-xid', '--recovery-taget-timeline', '--recovery-taget-name', '--restore_command', '--primary_conninfo', '--primary-slot-name', '--tablespace-mapping', '--remote-host' and '--remote-path' can be a list

  • After restore, you may need to call canopy_update_node to change coordinator and datanode's metadata like host and port.

  • Add option '--parallel-num' to execute lt_probackup concurrently. default num is 1.

  • Add option '--no_distribution' for execute like lt_probackup.

show

  • By specify cn instance name, it will get all datanode instance name, and execute command for them.

  • show command support option '--no_distribution' for execute like lt_probackup.

  • show command will show distributed backup status. It's status will not be ok if the backup status of a node is not ok.

  • With '--detail', show command will show all node's backup info. This is the default behavior in LightDB version 22.4. In LightDB version 23.1 or later, you must specified '--detail' to show all node's backup info.

set-config/show-config/checkdb/validate/delete/del-instance

  • By specify cn instance name, it will get all datanode instance name, and execute command for them.

  • set-config, delete and del-instance command support option '--no_distribution' for execute like lt_probackup.

  • For checkdb, connection option must be set to coordinator if exist. It is recommended to use it by specifying -B and --instance.

  • delete's option '-i' can be a list.

  • validate's option '-i', '--recovery-taget-xid', '--recovery-taget-lsn', '--recovery-target-timeline' and '--recovery-target-name' can be a list.

set-backup/merge

  • Usage is same with lt_probackup, but with more log output, you can directly use lt_probackup.

Example

Preprocess

  • ssh encryption Free

Usage

backup server: 192.168.247.126; backup dir: /home/lightdb/backup coordinator info: 192.168.247.127:54332 coordinator data dir: /home/lightdb/data datanode1 info: 192.168.247.128:54332 datanode2 info: 192.168.247.129:54332 new cluster(cn;dn1;dn2): 192.168.247.130;192.168.247.131;192.168.247.132

  1. init backup dir, dir path: /home/lightdb/backup

    lt_distributed_probackup.py init -B /home/lightdb/backup
                        
  2. add instance, coordinator instance name: cn; coordinator instance ip,port: 192.168.247.128:54332; coordinator data dir: /home/lightdb/data.

    lt_distributed_probackup.py add-instance -B /home/lightdb/backup -D /home/lightdb/data --instance cn -h192.168.247.128 -p5432
                        
  3. Setting up Continuous WAL Archiving after add instance. datanode's instance is generated when execute add-instance command.

    # cn
    archive_mode=on
    archive_command="/home/lightdb/lightdb-x/13.8-24.2/bin/lt_probackup archive-push -B /home/lightdb/backup --instance=cn --wal-file-name=%f --remote-host=192.168.247.126"
    # dn1
    archive_mode=on
    archive_command="/home/lightdb/lightdb-x/13.8-24.2/bin/lt_probackup archive-push -B /home/lightdb/backup --instance=cn_dn_1 --wal-file-name=%f --remote-host=192.168.247.126"
                        
  4. full backup distributed cluster

    lt_distributed_probackup.py backup -B /home/lightdb/backup -D /home/lightdb/data --instance cn -h192.168.247.128 -p5432 -b full --parallel-num=1
                        
  5. restore LightDB distributed cluster

    restore to original LightDB distributed cluster

    lt_distributed_probackup.py restore  -B /home/lightdb/backup --instance cn --parallel-num=1
                        

    restore to new LightDB distributed cluster

    lt_distributed_probackup.py restore  -B /home/lightdb/backup --instance cn --parallel-num=1 --remote-host='192.168.247.130;192.168.247.131;192.168.247.132'
                        
  6. checkdb, backup must be on local server

    lt_distributed_probackup.py checkdb -B /home/lightdb/backup  --instance cn
    lt_distributed_probackup.py checkdb -h10.19.70.50 -p54332
                        
  7. set config for instance

    lt_distributed_probackup.py set-config -B /home/lightdb/backup  --instance cn  --archive-timeout=6min
                        
  8. show-config for instance

    lt_distributed_probackup.py show-config -B /home/lightdb/backup  --instance cn
                        
  9. validate backup

    lt_distributed_probackup.py validate -B /home/lightdb/backup  --instance cn
                        
  10. show backup

    lt_distributed_probackup.py show -B /home/lightdb/backup  --instance cn
                        
  11. delete backup

    lt_distributed_probackup.py delete -B /home/lightdb/backup  --instance cn  --delete-expired --retention-redundancy=1
                        
  12. delete instance

    lt_distributed_probackup.py del-instance -B /home/lightdb/backup  --instance cn
                        
  13. merge, set-backup is same with lt_probackup

    lt_distributed_probackup.py merge -B /home/lightdb/backup  --instance cn -i RMICH6 -j 2 --progress --no-validate --no-sync
    lt_distributed_probackup.py set-backup -B /home/lightdb/backup  --instance cn_dn_2  -i RMI6YR --note='cn_dn_2'
                        

Limitations

lt_distributed_probackup.py currently has the following limitations:

  • The server from which the backup was taken and the restored server must be compatible by the block_size(Reports the size of a disk block. It is determined by the value of BLCKSZ when building the server. The default value is 8192 bytes.) and wal_block_size(Reports the size of a WAL disk block. It is determined by the value of XLOG_BLCKSZ when building the server. The default value is 8192 bytes.) parameters and have the same major release number.

  • If you do not have permission to create a directory, the directory is not actually created, although the init command is successfully executed.

  • When running remote operations via ssh, remote and local lt_probackup versions must be the same.

  • Even if you only need to back up the local LightDB Distributed database clusters, you also need to configure the local non-encryption for 127.0.0.1 and local ip

  • The max parallel depend on '-j' and '--parallel-num' may limited by sshd config 'MaxStartups'. When 'MaxStartups' is small, execute lt_distributed_probackup.py with big parallel num may report error: 'ERROR: Agent error: kex_exchange_identification: read: Connection reset by peer'. If you want more concurrency, you should turn 'MaxStartups' up in /etc/ssh/sshd_config, and restart sshd service.