Recommended Monitoring and Maintenance Tasks
This section lists monitoring and maintenance activities recommended to ensure high availability and consistent performance of your LightDB-A Database cluster.
The tables in the following sections suggest activities that a LightDB-A System Administrator can perform periodically to ensure that all components of the system are operating optimally. Monitoring activities help you to detect and diagnose problems early. Maintenance activities help you to keep the system up-to-date and avoid deteriorating performance, for example, from bloated system tables or diminishing free disk space.
It is not necessary to implement all of these suggestions in every cluster; use the frequency and severity recommendations as a guide to implement measures according to your service requirements.
Parent topic: Managing a LightDB-A System
Database State Monitoring Activities
Activity | Procedure | Corrective Actions |
---|---|---|
List segments that are currently down. If any rows are
returned, this should generate a warning or alert. Recommended frequency: run every 5 to 10 minutes Severity: IMPORTANT |
Run the following query in the postgres
database:
|
If the query returns any rows, follow these steps to correct
the problem:
|
Check for segments that are up and not in sync. If rows are
returned, this should generate a warning or alert. Recommended frequency: run every 5 to 10 minutes |
Execute the following query in the
postgres
database:
|
If the query returns rows then the segment might be in the process
of moving from Not In Sync to
Synchronized mode. Use gpstate
-e to track progress. |
Check for segments that are not operating in their preferred role but are marked as up and Synchronized .
If any segments are found, the cluster may not be
balanced. If any rows are returned this should generate a
warning or alert. Recommended frequency: run every 5 to 10 minutes Severity: IMPORTANT |
Execute the following query in the
postgres
database:
|
When the segments are not running in their preferred role, processing might be skewed.
Run |
Run a distributed query to test that it runs on all segments.
One row should be returned for each primary segment.
Recommended frequency: run every 5 to 10 minutes Severity: CRITICAL |
Execute the following query in the
postgres
database:
|
If this query fails, there is an issue dispatching to some segments in the cluster. This is a rare event. Check the hosts that are not able to be dispatched to ensure there is no hardware or networking issue. |
Test the state of coordinator mirroring on LightDB-A Database. If
the value is not “STREAMING”, raise an alert or
warning. Recommended frequency: run every 5 to 10 minutes Severity: IMPORTANT |
Run the following
psql
command:
|
Check the log file from the coordinator
and standby coordinator for errors. If there are no unexpected
errors and the machines are up, run the
|
Perform a basic check to see if the coordinator is up and
functioning. Recommended frequency: run every 5 to 10 minutes Severity: CRITICAL |
Run the following query in the
postgres
database:
|
If this query fails, the active coordinator may be down. Try to start the database on the original coordinator if the server is up and running. If that fails, try to activate the standby coordinator as coordinator. |
Hardware and Operating System Monitoring
Activity | Procedure | Corrective Actions |
---|---|---|
Check disk space usage on volumes used for LightDB-A Database data storage and the OS.
Recommended frequency: every 5 to 30 minutes Severity: CRITICAL |
Set up a disk space check.
|
Use VACUUM /VACUUM FULL on
user tables to reclaim space occupied by dead rows. |
Check for errors or dropped packets on the network
interfaces. Recommended frequency: hourly Severity: IMPORTANT |
Set up a network interface checks. |
Work with network and OS teams to resolve errors. |
Check for RAID errors or degraded RAID performance.
Recommended frequency: every 5 minutes Severity: CRITICAL |
Set up a RAID check. |
|
Check for adequate I/O bandwidth and I/O skew. Recommended frequency: when create a cluster or when hardware issues are suspected. |
Run the LightDB-A
gpcheckperf utility. |
The cluster may be under-specified if data transfer rates are
not similar to the following:
If the machines on the cluster display an uneven performance profile, work with the system administration team to fix faulty machines. |
Catalog Monitoring
Activity | Procedure | Corrective Actions |
---|---|---|
Run catalog consistency checks in each database to ensure the
catalog on each host in the cluster is consistent and in a good
state. You may run this command while the database is up and running. Recommended frequency: weekly Severity: IMPORTANT |
Run the LightDB-A gpcheckcat utility in each
database:
Note: With the
-O option, gpcheckcat
runs just 10 of its usual 15 tests. |
Run the repair scripts for any issues identified. |
Check for pg_class entries that have no
corresponding pg_attribute entry.Recommended frequency: monthly Severity: IMPORTANT |
During a downtime, with no users on the system, run the
LightDB-A
gpcheckcat utility in each
database:
|
Run the repair scripts for any issues identified. |
Check for leaked temporary schema and missing schema
definition. Recommended frequency: monthly Severity: IMPORTANT |
During a downtime, with no users on the system, run the
LightDB-A
gpcheckcat utility in each
database:
|
Run the repair scripts for any issues identified. |
Check constraints on randomly distributed
tables. Recommended frequency: monthly Severity: IMPORTANT |
During a downtime, with no users on the system, run the
LightDB-A
gpcheckcat utility in each
database:
|
Run the repair scripts for any issues identified. |
Check for dependencies on non-existent objects. Recommended frequency: monthly Severity: IMPORTANT |
During a downtime, with no users on the system, run the
LightDB-A
gpcheckcat utility in each
database:
|
Run the repair scripts for any issues identified. |
Data Maintenance
Activity | Procedure | Corrective Actions |
---|---|---|
Check for missing statistics on tables. | Check the gp_stats_missing view in each
database:
|
Run ANALYZE on tables that are missing
statistics. |
Check for tables that have bloat (dead space) in data files
that cannot be recovered by a regular VACUUM
command. Recommended frequency: weekly or monthly Severity: WARNING |
Check the gp_bloat_diag view in each
database:
|
VACUUM FULL acquires an ACCESS
EXCLUSIVE lock on tables. Run VACUUM
FULL during a time when users and applications do
not require access to the tables, such as during a time of low
activity, or during a maintenance window. |
Database Maintenance
Activity | Procedure | Corrective Actions |
---|---|---|
Reclaim space occupied by deleted rows in the heap
tables so that the space they occupy can be
reused. Recommended frequency: daily Severity: CRITICAL |
Vacuum user
tables:
|
Vacuum updated tables regularly to prevent bloating. |
Update table statistics. Recommended frequency: after loading data and before executing queries Severity: CRITICAL |
Analyze user tables. You can use the
analyzedb management
utility:
|
Analyze updated tables regularly so that the optimizer can produce efficient query execution plans. |
Backup the database data. Recommended frequency: daily, or as required by your backup plan Severity: CRITICAL |
Run the gpbackup
utility to create a backup of the coordinator and segment
databases in parallel.
|
Best practice is to have a current backup ready in case the database must be restored. |
Vacuum, reindex, and analyze system catalogs to maintain
an efficient catalog. Recommended frequency: weekly, or more often if database objects are created and dropped frequently |
|
The optimizer retrieves information from the system
tables to create query plans. If system tables and indexes
are allowed to become bloated over time, scanning the system
tables increases query execution time. It is important to
run ANALYZE after reindexing, because
REINDEX leaves indexes with no
statistics. |
Patching and Upgrading
Activity | Procedure | Corrective Actions |
---|---|---|
Ensure any bug fixes or enhancements are applied to the
kernel. Recommended frequency: at least every 6 months Severity: IMPORTANT |
Follow the vendor’s instructions to update the Linux kernel. | Keep the kernel current to include bug fixes and security fixes, and to avoid difficult future upgrades. |
Install LightDB-A Database minor releases, for example
5.0.x. Recommended frequency: quarterly Severity: IMPORTANT |
Follow upgrade instructions in the LightDB-A Database Release Notes. Always upgrade to the latest in the series. | Keep the LightDB-A Database software current to incorporate bug fixes, performance enhancements, and feature enhancements into your LightDB-A cluster. |