PostgreSQL Cluster on Kubernetes

Usage

Database Version

The PostgreSQL version can be controlled via the pg_version string. The version MUST be one a valid tag of an image from the CNPG container repository.

The default value for pg_version is the one that we test in our clusters, so we recommend starting there.

At this time, major version upgrades are not supported for existing clusters (coming soon). Minor version upgrades will be applied automatically when pg_version is changed.

Custom PostgreSQL Images with Extensions

If you need PostgreSQL extensions (like pgvector, PostGIS, timescaledb, etc.), you can use a custom PostgreSQL container image instead of the default CNPG image.

Using Pre-built CNPG Images

CloudNativePG provides pre-built images with popular extensions:

module "database" {
  source = "${var.pf_module_source}kube_pg_cluster${var.pf_module_ref}"

  pg_custom_image = "ghcr.io/cloudnative-pg/postgis:17"
  # pg_version is ignored when pg_custom_image is set

  ...
}

Building Your Own Custom Image

For custom extension combinations, build your own image using Docker Bake:

Create a bake.hcl file:

extensions = [
  "pgvector",
  "pg-stat-kcache",
]

target "myimage" {
  dockerfile-inline = <<EOT
ARG BASE_IMAGE="ghcr.io/cloudnative-pg/postgresql:16.9-standard-bookworm"
FROM $BASE_IMAGE AS myimage
ARG EXTENSIONS
USER root
RUN apt-get update && \
    apt-get install -y --no-install-recommends $EXTENSIONS && \
    apt-get purge -y --auto-remove && \
    rm -rf /var/lib/apt/lists/*
USER 26
EOT

  matrix = {
    pgVersion = ["16.9"]
  }
  # ... rest of config
}

Build and push the image:

registry=myregistry.io/myteam docker buildx bake \
  -f docker-bake.hcl \
  -f cwd://bake.hcl \
  "https://github.com/cloudnative-pg/postgres-containers.git" \
  myimage --push

Use the custom image:

module "database" {
  source = "${var.pf_module_source}kube_pg_cluster${var.pf_module_ref}"

  pg_custom_image = "myregistry.io/myteam/postgresql:16.9-custom"

  ...
}

For detailed instructions on building custom images, see the CNPG blog post on Docker Bake.

Custom images bypass the ECR pull through cache. Ensure:

Your image is accessible from your Kubernetes cluster
Proper image pull secrets are configured if using private registries
The image is compatible with CloudNativePG (includes required PostgreSQL executables)

Available Extensions

Common Debian package extensions you can install:

postgresql-NN-pgvector - Vector similarity search
postgresql-NN-postgis-3 - Geographic objects support
postgresql-NN-cron - Job scheduler
postgresql-NN-partman - Partition management
postgresql-NN-pg-stat-kcache - Kernel statistics
And many more from apt.postgresql.org

Replace NN with your PostgreSQL major version (e.g., 16, 17).

Credentials

For in-cluster applications, credentials can be sourced from the following Kubernetes Secrets named in the module’s outputs:

superuser_creds_secret: Complete access to the database
admin_creds_secret: Read and write access to the database (does not include the ability to preform sensitive operations like schema or permission manipulation)
reader_creds_secret: Read-only access to the database

Each of the above named Secrets contains the following values:

username: The username to use for authentication
password: The password to use for authentication

The credentials in each Secret are managed by Vault and rotated automatically before they expire. In the Panfactum Stack, credential rotation will automatically trigger a pod restart for pods that reference the credentials.

The credential lifetime is configured by the vault_credential_lifetime_hours input (defaults to 16 hours). Credentials are rotated 50% of the way through their lifetime. Thus, in the worst-case, credentials that a pod receives are valid for vault_credential_lifetime_hours / 2.

The module also supplies root_username and root_password outputs for the root user of the database. These credentials are unsafe to use as they are not automatically rotated without re-applying this module.

Connecting

The below example show how to connect to the PostgreSQL cluster through the read-write PgBouncer using dynamically rotated admin credentials by setting various environment variables in our kube_deployment module.

module "database" {
  source = "${var.pf_module_source}kube_pg_cluster${var.pf_module_ref}"
  ...
}

module "deployment" {
  source = "${var.pf_module_source}kube_deployment${var.pf_module_ref}"
  ...

  common_env_from_secrets = {
    POSTGRES_USERNAME = {
      secret_name = module.database.admin_creds_secret
      key = "username"
    }
    POSTGRES_PASSWORD = {
      secret_name = module.database.admin_creds_secret
      key = "password"
    }
  }
  common_env = {
    POSTGRES_HOST = module.database.pooler_rw_service_name
    POSTGRES_PORT = module.database.pooler_rw_service_port
  }
}

Storage

You must provide an initial storage amount for the database with pg_initial_storage_gb. This configures the size of the underlying EBS volumes.

Once the database is running, the PVC autoresizer (provided by kube_pvc_autoresizer) will automatically expand the EBS volumes once the free space drops below pg_storage_increase_threshold_percent of the current EBS volume size. The size of the EBS volume will grow by pg_storage_increase_gb on every scaling event until a maximum of pg_storage_limit_gb.

Note that a scaling event can trigger at most once every 6 hours due to an AWS limitation. As a result, ensure that pg_storage_increase_gb is large enough to satisfy your data growth rate.

Resource Allocation

The resources (CPU and memory) given to each PostgreSQL node is automatically scaled by the Vertical Pod Autoscaler when vpa_enabled is true (the default). You can control the ranges of the resources allocated to the pods via the following inputs:

pg_minimum_memory_mb
pg_maximum_memory_mb
pg_minimum_cpu_millicores
pg_maximum_cpu_millicores

Similarly, resources given to the PgBouncer instances are controlled via:

pgbouncer_minimum_memory_mb
pgbouncer_maximum_memory_mb
pgbouncer_minimum_cpu_millicores
pgbouncer_maximum_cpu_millicores

If vpa_enabled is false, the actual resource requests and limits will be set to the minimums.

You should take care to tune the memory minimums appropriately, especially for bursty workloads. If you are regularly issuing queries that take more than twice the 95th percentile memory utilization, you must manually set pg_minimum_memory_mb to a sensible value for your workloads as the VPA will under-provision resources.

Additionally, you should review our section on memory tuning below.

Memory Tuning

By default, we tune the PostgreSQL memory settings in accordance with the EDB recommendations since EDB is the original creator of CNPG.

However, PostgreSQL databases can be used to run such a wide array of workloads that you may want to tune the settings further for your particular use case.

These three settings are the most important:

pg_work_mem_percent: How much memory is set aside for intermediate calculations in a query (i.e., sorts, joins, etc.).
pg_max_connections: How many simultaneous connections / queries you can run in the database.
pg_shared_buffers_percent: How much of the memory given to the cluster is set aside for caching data so that it does not need to be read from disk.

pg_work_mem_percent is the most important and most likely to slow down your complex queries. This parameter represents the total memory set aside for all connections for their query calculations.

pg_max_connections is important as memory is allocated and limited on a per-operation basis. As a result, the actual memory available to each query is roughly pg_memory_mb * pg_work_mem_percent / pg_max_connections. This is true regardless of whether you are actually using the maximum number of connections. Therefore, if you are using the database to run large analytical queries, you may want to lower the pg_max_connections value in order to allow each query to use more of the working memory pool.

pg_shared_buffers_percent will not typically be a source of issues as the linux page cache will step in if this value is not large enough. However, the internal PostgreSQL cache controlled by this value will always be more performant than the generic page cache, so tuning this can help in some circumstances.

Shutdowns and Failovers

Postgres has three shutdown modes:

Smart Shutdown: Prevents new connections, but allows existing sessions to finish their work before shutting down.
Fast Shutdown: Prevents new connections and forces all existing sessions to abort safely before shutting down.
Immediate Shutdown: Immediately exits without doing normal database shutdown processing (including forcibly killing sessions w/ doing graceful aborts). Forces a database recovery operation on the next startup.

The default behavior of this module is to do a fast shutdown with a 30-second timeout (pg_switchover_delay) until an immediate shutdown is issued.

When running with a replica, this results in at most 30 seconds of downtime if the primary instance is terminated (the replicas are still readable); however, it will normally be around 5 seconds.

This is quite quick, but the downside is that any running queries on the primary will be immediately terminated and not allowed to complete. You can increase the amount of time that running queries are allowed to complete by increasing pg_smart_shutdown_timeout. However, this will increase the time that new sessions cannot be made with database proportionally (i.e., setting pg_smart_shutdown_timeout to 30 will allow 30 seconds for existing queries to complete but increase the amount of time that new queries cannot be made to about 35 seconds).

We generally recommend keeping pg_smart_shutdown_timeout set to the default 1 (minimum allowed by CNPG) in order to minimize downtime. Instead of trying to ensure queries will always complete, we recommend that you implement retry logic in your database client code. This will not only add resilience to this particular scenario, but will also be beneficial in other failure modes.

For more information about shutdowns, please see the CNPG documentation.

Synchronous Replication

By default, this module asynchronously replicates data written to the primary instance to each replica. This ensures that writes can be completed relatively quickly and that writes can continue even if the replicas are unavailable.

Generally, this is safe. On a shutdown or failover of the primary, writes are paused and replicas are given a chance to catch up with the primary before one of them is promoted.

However, there are a few drawbacks:

In the case of a catastrophic instance failure on the primary (e.g., power outage), there is a possibility of data loss (up to roughly 5 seconds).
If you are reading from the replicas, you may be reading stale data (up to roughly 5 seconds old).
About 5 seconds of additional downtime must be accounted for on failovers in order to allow the replicas to catch-up.

To avoid these drawbacks, you can set pg_sync_replication_enabled. However, this comes with its own tradeoffs:

Database mutations will take significantly longer because writes must be distributed across every replica before returning successfully.
Database mutations will not complete unless all replicas are available. This will increase the chance of downtime for write operations.

We advise you to stick with the default asynchronous replication unless you have a specific, demonstrated need for synchronous replication.

Recovery

In the case of an emergency, you can recover the database from the backups and WAL archives stored in S3.

Complete the following steps:

Run kubectl cnpg status -n <cluster_namespace> <cluster_name>. Verify that there exists a “First Point of Recoverability”.
If this is not available, that means that your logical PostgreSQL backups were not configured correctly and are not available. You will need to restore from the hourly EBS snapshots created by Velero instead.
Retrieve the backup_directory output from this module by running terragrunt output.
Delete the cluster resource manually via kubectl delete clusters.postgresql.cnpg.io -n <cluster_namespace> <cluster>.
Set the pg_recovery_mode_enabled module input to true and the pg_recovery_directory to the backup_directory output you retrieved in step 2.
Optionally, you can set the pg_recovery_target_time to an RFC 3339 timestamp (e.g., 2023-08-11T11:14:21.00000+00) to recover the database to a particular point in time. This must be after the “First Point of Recoverability” that was reported in step 1.
If pg_recovery_target_time is not provided, the database will be recovered to the latest data stored in S3 which should be within 5 seconds of the last database write.
Set the pg_backup_directory module input to anything other than the backup_directory output you retrieved in step 2. This ensures that the new cluster will not use overwrite the existing backup directory and instead create a new one.
Re-apply the module that contains this submodule and wait for the recovery to complete. The database should successfully come back online.
When the recovered database is back online, an initial backup of the new database will be performed. You can monitor it’s progress from the :backups.postgresql.cnpg.io in k9s. When this complete, you should see a “First Point of Recoverability” when running kubectl cnpg status -n <cluster_namespace> <cluster_name>. If an initial backup cannot be created, something has gone wrong, and you should restart the recovery process.
After the initial backup for the recovered database is created, you can optionally delete the pg_recovery_directory directory from the S3 bucket provided by the backup_bucket_name output. This can save space as that old backup is no longer needed.

Note that the “First Point of Recoverability” is determined by the backups_retention_days input; backups older than backups_retention_days (default 3) will be deleted, and you will no longer be able to recover to that point in time.

For more information on recovery procedures, see the CNPG recovery documentation.

Note that you can optionally restore from an alternate backup bucket by setting the pg_recovery_bucket input. This can be useful if you are trying to set up a new forked database from running system.

Recovering from an Alternate Backup Bucket

You can recover a PostgreSQL cluster from backups stored in a different S3 bucket by using the pg_recovery_bucket variable. This is particularly useful for:

Forking production data to staging/development environments
Cross-account disaster recovery scenarios
Migrating databases between clusters

Example Configuration:

module "database" {
  source = "${var.pf_module_source}kube_pg_cluster${var.pf_module_ref}"

  pg_recovery_mode_enabled = true
  pg_recovery_bucket      = "prod-pg-cluster-backup-a1b2c3d4"
  pg_recovery_directory   = "backups/cluster-prod"
  pg_backup_directory     = "backups-from-prod-recovery"  # Different from source

  ...
}

Required: Production Bucket Policy

The production S3 bucket must grant read access to the IAM role of the recovering cluster. Add the following policy to your production backup bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadOnlyAccessForDevelopment",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:root"
      },
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:ListBucket",
        "s3:ListBucketVersions"
      ],
      "Resource": [
        "arn:aws:s3:::prod-pg-cluster-backup-a1b2c3d4/*",
        "arn:aws:s3:::prod-pg-cluster-backup-a1b2c3d4"
      ]
    },
    {
      "Sid": "FullAccessForProduction",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::987654321098:root"
      },
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::prod-pg-cluster-backup-a1b2c3d4/*",
        "arn:aws:s3:::prod-pg-cluster-backup-a1b2c3d4"
      ]
    },
    {
      "Sid": "DenyWriteFromDevelopment",
      "Effect": "Deny",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:root"
      },
      "Action": [
        "s3:DeleteObject",
        "s3:DeleteObjectVersion",
        "s3:PutObject",
        "s3:PutObjectAcl",
        "s3:AbortMultipartUpload"
      ],
      "Resource": "arn:aws:s3:::prod-pg-cluster-backup-a1b2c3d4/*"
    }
  ]
}

Replace:

123456789012 with your development/staging AWS account ID
987654321098 with your production AWS account ID
prod-pg-cluster-backup-a1b2c3d4 with your actual production bucket name

Recovery to a Specific Backup Point

Instead of recovering to a specific timestamp with pg_recovery_target_time, you can recover to the end of a specific base backup using pg_recovery_target_immediate. This stops recovery at the point where the backup ended, without replaying any additional WAL files.

When to use pg_recovery_target_immediate:

You want a clean snapshot from a known backup point
You want to avoid replaying WAL files (faster recovery)
You’re testing disaster recovery procedures with a specific backup

Example Configuration:

module "database" {
  source = "${var.pf_module_source}kube_pg_cluster${var.pf_module_ref}"

  pg_recovery_mode_enabled     = true
  pg_recovery_bucket           = "prod-pg-cluster-backup-a1b2c3d4"
  pg_recovery_directory        = "backups/cluster-prod"
  pg_recovery_target_immediate = "20251015T121455"  # Backup ID timestamp
  pg_backup_directory          = "backups-from-specific-point"

  ...
}

The pg_recovery_target_immediate value must be a backup ID in the format YYYYMMDDTHHmmss (e.g., 20251015T121455). You can find available backup IDs by running:

kubectl cnpg backup list -n <namespace> <cluster-name>

Or by inspecting the backup directory structure in your S3 bucket.

Disruptions

By default, failovers of PostgreSQL pods in this module can be initiated at any time. This enables the cluster to automatically perform maintenance operations such as instance resizing, AZ re-balancing, version upgrades, etc. However, every time a PostgreSQL pod is disrupted, running queries will be terminated prematurely and a short period of downtime might occur if the disrupted pod is the primary instance (see the Shutdowns and Failovers section).

You may want to provide more control over when these failovers can occur, so we provide the following options:

Disruption Windows

Disruption windows provide the ability to confine disruptions to specific time intervals (e.g., periods of low load) if this is needed to meet your stability goals. You can enable this feature by setting voluntary_disruption_window_enabled to true.

The disruption windows are scheduled via voluntary_disruption_window_cron_schedule and the length of time of each window via voluntary_disruption_window_seconds.

If you use this feature, we strongly recommend that you allow disruptions at least once per day, and ideally more frequently.

For more information on how this works, see the kube_disruption_window_controller submodule.

Custom PDBs

Rather than time-based disruption windows, you may want more granular control of when disruptions are allowed and disallowed.

You can do this by managing your own PodDisruptionBudgets. This module provides outputs that will allow you to match certain subsets of pods for both PostgreSQL and PgBouncer.

For example:

module "database" {
  source = "${var.pf_module_source}kube_pg_cluster${var.pf_module_ref}"
  ...
}

resource "kubectl_manifest" "pdb" {
  yaml_body = yamlencode({
    apiVersion = "policy/v1"
    kind       = "PodDisruptionBudget"
    metadata = {
      name      = "custom-pdb"
      namespace = module.database.namespace
    }
    spec = {
      unhealthyPodEvictionPolicy = "AlwaysAllow"
      selector = {
        matchLabels = module.database.cluster_match_labels # Selects all PostgreSQL pods
      }
      maxUnavailable = 0 # Prevents any disruptions
    }
  })
  force_conflicts   = true
  server_side_apply = true
}

While this example is constructed via IaC, you can also create / destroy these PDBs directly in your application logic via YAML manifests and the Kubernetes API. This would allow you to create a PDB prior to initiating a long-running query that you do not want disrupted and then delete it upon completion.

Completely Disabling Voluntary Disruptions

Allowing the cluster to periodically initiate failovers of PostgreSQL is critical to maintaining system health. However, there are rare cases where you want to override the safe behavior and disable voluntary disruptions altogether. Setting the voluntary_disruptions_enabled to false will set up PDBs that disallow any voluntary disruption of either PostgreSQL or PgBouncer pods.

This is strongly discouraged. If limiting any and all potential disruptions is of primary importance you should instead:

Create a one-hour weekly disruption window to allow some opportunity for automatic maintenance operations
Ensure that spot_instances_enabled and burstable_instances_enabled are both set to false
Connect through PgBouncer with pgbouncer_pool_mode set to transaction

Note that the above configuration will significantly increase the costs of running PostgreSQL (2.5-5x) versus more flexible settings. In the vast majority of cases, this is entirely unnecessary, so this should only be used as a last resort.

Enabling PDBs either manually or via disruption windows will not prevent all forms of disruption, only voluntary ones. A voluntary disruption is one that is done through the Eviction API and limited by the use of PDBs.

An example of a non-voluntary disruption would be via spot node termination or resource constraints. As a result, you should still implement defensive coding practices in your client code to account for potential disruptions.

PostgreSQL Parameters

PostgreSQL comes with hundreds of parameters that can be used to customize its behavior. You can see the full set of available values here, and you can provide them to this module via pg_parameters.

This can be used overwrite any default settings this module provides.

Extra Schemas

When initially created, the CNPG cluster has just one schema (public) in the app database. However, you may choose to add more in the future.

If you do, you will need to add those schemas to the extra_schemas input. This will ensure that users and roles created by the Vault auth system will have access to the objects in those schemas.

Note that this will NOT create the extra schemas; you should do that in your database migration scripts.