Edge Releases

Edge releases do not receive patches nor make any backwards compatibility guarantees.

You should avoid using these releases in production environments. Learn more here.

To upgrade your Panfactum stack version, please follow the instructions in the upgrade guide.

Unreleased

Breaking Changes

We now include KEDA in our base Panfactum clusters and our modules assume that you have it installed. See the instructions here.
burstable_nodes_enabled now defaults to true for all modules. This fixes the default behavior which was broken in edge.25-03-26.
The contact information variables on aws_account have been consolidated to single objects for each contact as they are no longer strictly required due to the new aws_organization module.
The contact information variables on aws_registered_domains have been consolidated to single objects for each contact to provide better compatibility with the new Panfactum CLI.
The inputs to aws_dns_zones have been consolidated to a single domains object in order to provide better per-domain granular configuration.

Added

Adds the kube_job submodule to support running one-off Jobs as a part of larger module deployment processes. This can be used to execute pre-/post-deployment scripts such as database migrations.
Adds KEDA to the base cluster installation with the kube_keda module.
Adds sub_paths to config_map_mounts and secret_mounts inputs in all applicable Panfactum submodules to support more granular control over mounted files.
Adds backup:TagResource to tf_bootstrap_resource to address AWS Notification.

Fixed

Fixed issue where AWS SSO sessions would not automatically sync across profiles, resulting in many unnecessary SSO login prompts. Moving forward, you will only need to login with SSO once and those credentials will be available for all AWS profiles.
Fixes the default value for min_node_cpu of kube_karpenter_node_pools so that an error no longer occurs.
Addressed an issue that slowed down node bootstrapping due to self-imposed rate-limits on image pulling.
Fixes private repo authentication in wf_dockerfile_build.
Adds missing lifetime_evictions_enabled input to kube_stateful_set.
Adds additional AAAA record for IPV6 support for aws_cdn.
Addresses invalided policy document format for the DNSSEC KMS keys in aws_dnssec.

edge.25-04-03

This release serves as the base for the stable.25-04 release channel.

Changed

wf_dockerfile_build will now have faster retry logic for failed builds, and the retry count is now configurable via the retry_max_attempts input.

Added

Adds new var for min_node_cpu to kube_karpenter_node_pools to set a minimum CPU threshold for nodes.
Provides the capability to configure many JetStream Stream settings in kube_argo_event_bus.
Adds support for single-platform images in wf_dockerfile_build.
Adds support for restoring from an alternate backup bucket in kube_pg_cluster.
Adds support for unrestricted S3 presigned URLs in kube_sa_auth_aws.

Fixed

Event streams in kube_argo_event_bus are now properly replicated across all NATS servers.
Resolves issue in kube_nats that prevented ACKs from being sent when synchronously publishing events in some cases. In the degenerate cases, this could result in an inability to publish events at all until either the publisher or the NATS server was restarted. This had significant impact on the overall availability of Argo Events.
Resolves issue in kube_argo that prevented resources controlled by Argo Events from being updated without a controller restart.
Fixes the nix version check to take into account the new version format provided by the Determinate Systems nix installer.
Fixes a problem in wf_dockerfile_build that caused the workflow to fail if the image already existed in the ECR repository. Moving forward, the workflow will simply skip the build step if the image already exists.
Velero backup logs are now available in the velero CLI.

edge.25-03-26

Breaking Changes

Across all Panfactum modules, burstabled_nodes_enabled no longer automatically adds a spot instance toleration. This enables users to run on-demand, burstable instances when appropriate. This means that spot_nodes_enabled must now be set to true explicitly if spot instances are desired.
The backup directory for kube_pg_cluster is now explicitly specified by pg_backup_directory instead of being randomly generated. This release will automatically restart your backup history unless you explicitly set pg_backup_directory to the recovery_directory output value from the previous release. You should ensure that backups are working correctly after upgrading to this release.
Additionally, the recovery_directory output has been renamed to backup_directory to better reflect its purpose. This has been updated in the recovery documentation.

Added

Adds support for accepting the transform parameter in kube_argo_sensor dependencies input to allow for custom transformations of incoming events.
Adds support for spot_nodes_enabled, burstable_nodes_enabled, and controller_nodes_enabled inputs to all direct Kubernetes modules.
Adds a spot_nodes_enabled (default true) input to aws_eks module to disable spot instances for controller nodes.
Adds support for automatically preventing disruptions when backups are running in kube_pg_cluster.
Adds automatic garbage collection of orphaned persistent volumes via kube_policies.
Adds automatic garbage collection of failed backups in kube_pg_cluster.
Adds min_node_memory_mb to kube_karpenter_node_pools to set a minimum memory threshold for nodes.

Fixed

Optimizes the backup process for kube_pg_cluster to provide a 100x improvement in backup throughput.
Upgrades the kubectl-cnpg plugin in the devShell to be compatible with the latest version of the CNPG operator in the cluster.
Fixes version pinning on first party modules issue 237

edge.25-03-04

Fixed

Mongodb provider is no longer generated when not utilized
kube_cert_issuer module now correctly uses cloudflare_zones input when generating certs

edge.25-02-28

Added

Adds authentik_atlas_mongodb_sso & mongodb_atlas_identity_provider module for setting up the Authentik Application and Provider for Atlas MongoDB.
Adds authentik_github_sso module for setting up the Authentik Application and Provider for Github.
Adds new inputs to kube_pg_cluster for configuring the pg_wal_keep_size_gb settings and adding additional s3_bucket_access_policy for the s3 bucket

Fixed

Fixes where the service account name is not set for the event source pod in kube_argo_event_source

Changed

Node-local image caching is now enabled by default in Panfactum submodules.

edge.25-02-21

Added

Adds support for using a private git repository for first-party IaC modules by providing GIT_USERNAME and GIT_PASSWORD environment variables. See the updated documentation.
kube_policies now takes common_pod_labels and common_pod_annotations inputs which can be used to apply a standard set of labels and/or annotations to all pods in the cluster.

Fixed

Addressed issue where the .kube/.gitignore file was not being created.
DaemonSets in the cluster will update in a constant time. Previously the update time scaled with the number of nodes in the cluster which led to timeouts.
Resolves a bug that caused wf_tf_deploy workflows to fail.
Resolves a bug that caused module deployment to fail if Kubernetes settings weren’t set for the region even if Kubernetes wasn’t used.

edge.25-02-18

This release causes issues in the CI/CD pipelines for IaC deployments. This is resolved in the subsequent release.

Fixed

The pf provider will now receive Kubernetes metadata regardless of whether the Kubernetes provides are enabled in the module tree.
Pinning the version of first-party IaC modules should now work without error regardless of what version of the Panfactum modules are used (including if using a local copy).
ignore_replica_count in kube_deployment and kube_stateful_set will now properly not reset spec.replicas to the replicas input if spec.replicas has been mutated by an external process.
Fixed to use {} from using null for webhookConfigurations in kube_cert_manager.

edge.25-02-10

Breaking Changes

This update requires that you apply the kube_vpa before any other module. If you run into any issues, set vpa_enabled to false before you apply the module and re-enable once the module is deployed.

Added

Most Kubernetes modules now have a wait input that can be set to false if you do not wish to wait for the resources to reach a ready state before proceeding with the deployment. This will significantly improve the speed of deploying updates but will disable automatic rollback in case something goes wrong. Manual intervention may be required if deployment fails.

Fixed

kube_bastion now always uses two replicas to ensure tunnels can immediately reconnect if one bastion gets restarted.
Due to a bug in how Helm manages CRDs, CRDs included in kube_vpa were not appropriately updated in the previous release. This release resolves the issue.
Adjusts the bootstrapping steps for Karpenter to include instructions for managing the wait input.
Fixes an issue that prevented kube_policies from being deployed in the bootstrapping guide to non-existent node-image-cache namespace.

edge.25-02-07

This release contains a VPA CRD bug that will make it difficult to upgrade to the following release without manual intervention. Please skip this release and proceed directly to the next.

Changed

Enables the Access Token auth method for the Argo Workflows server to allow direct access to its API programmatically.
When using a Panfactum module, the vertical pod autoscaler will only evict pods when resources need to be scaled up not down. This should reduce some unnecessary resource thrash and improve overall cluster stability. As pod lifetimes are generally capped at four hours, downscaling will still occur (just not as frequently).

Added

Adds ability to pass in extra service annotations through kube_deployment module

Fixed

Added pg_minimum_cpu_update_millicores input to kube_pg_cluster in order to reduce autoscaling thrash caused by frequent small updates in the VPA’s CPU recommendations. Before this was introduced, setting vpa_enabled to true would occasionally cause significant instability.
Applied fix for argo-events write hole issue in kube_argo.
Fixes bug that prevented kube_cert_manager from being deployed when self_generated_certs_enabled was set to true.
Fixes aws_eks subnet validation check that prevented module deployment in some valid scenarios

edge.25-01-09

Added

kube_policies now has common_env and common_secrets inputs that inject environment variables into all containers in the cluster.

Fixed

Pins Bottlerocket OS AMIs to pre-tested versions as AWS occasionally publishes breaking AMI changes that can crash nodes in the cluster.
Fixes the pre- and post- condition check for the aws_eks module when sla_target is set to 1.

edge.25-01-04

Breaking Changes

This release adds some additional functionality to Vault which requires vault_auth_oidc to be upgraded before any other module.
The kube_rbac and kube_priority_classes modules have been removed per the deprecation notice in edge.24-12-13.

Added

Adds a module for deploying Grist, a next-generation spreadsheet system: kube_grist.
Adds an alternative mechanism for creating dynamically-rotated AWS credentials for when IRSA is not an option: kube_aws_creds.
kube_deployment and kube_stateful_set now provide native support for voluntary disruption windows.

Fixed

Addressed issue where pods could not be created if all Kyverno admission controllers are disrupted simultaneously. As the Kyverno admission controller is itself composed of pods, this would result in a cluster deadlock that required manual intervention. This degenerate behavior has been fully resolved in this release.
Addressed issue where the Kubernetes API server address was set incorrectly when deploying kube_cilium with wf_tf_deploy.
Helm charts deployed by Panfactum modules will not be automatically rolled back on deployment failure which should prevent several failure cases where manual intervention would have otherwise been necessary.
The StatefulSets in kube_nats no longer need to be redeployed after each update of resource tags / labels.
pf-tunnel now binds to 127.0.0.1 instead of localhost to resolve potential connectivity problems on diverse operating systems.

edge.24-12-19

Breaking Changes

Introduces the concept of SLA Target Levels. This makes it easier to (a) know what uptime you can expect from Panfactum deployments, and (b) make it easier to adjust the cost-to-availability tradeoff for entire subsections of the deployment.

This features comes with the following changes:

Provides a new Terragrunt variable, sla_target, that can be used to set the target level for a particular scope (e.g., environment, region, module). It defaults to 3.
The default behavior for Panfactum modules will now automatically adjust to the provided sla_target.
The enhanced_ha_enabled input has been removed from all modules. The previous behavior when enhanced_ha_enabled was set to true (the default) is now equivalent to setting sla_target to 3 (the default).
This release upgrades the following terraform provider versions which will need to be updated in first-party IaC:
pf: 0.0.5 -> 0.0.7

Added

Adds support for arbitrary path rewriting in kube_ingress, kube_aws_cdn, aws_cdn, and aws_s3_public_website.
wf_dockerfile_build now supports sourcing base images from private ECR repositories.
Adds not_found_path to aws_s3_public_website to facilitate specifying the asset to load when no object exists at the requested path.
Adds custom_error_responses to aws_cdn which can be used to overwrite error responses from the upstream origin.

Fixed

Addressed conflicting PDB issue with the kube_redis_sentinel module that prevented vertical autoscaling from working.
Standard Panfactum environment variables for Kubernetes workloads are now injected before user-defined environment variables to make them available for use in dependent variables.
Standard Panfactum environment variables for Kubernetes workloads will no longer override user-defined environment variables.
Addressed issue where the CRDS in kube_aws_lb_controller were not automatically upgraded.
Fixed incorrect AWS permissions in kube_aws_lb_controller.

edge.24-12-13

This Authentik upgrade contains a problem that will result in updates to group names not automatically synchronizing with AWS.

While we are working with Authentik to develop a workaround, it may be a few more releases until this is resolved. If that is a problem, you should defer an upgrade to this version until the problem is fixed.

This release contains an bug that will cause Cilium to crash if deployed via wf_tf_deploy. Please ensure you upgrade to edge.25-01-04 locally before re-enabling CI/CD deployments for the core infrastructure.

Breaking Changes

The kube_rbac module has been deprecated and will be removed in the next release. Please destroy any deployments of it after upgrading aws_eks.

Kubernetes access control has now been moved to the aws_eks module using EKS access entries. This provides several benefits:

Kubernetes RBAC now works out-of-the-box, making cluster bootstrapping simpler.
Accidental lock-out is now fully prevented.
One fewer location where custom SSO roles need to be synchronized.
The kube_priority_classes module has been consolidated with kube_policies in order to remove a superfluous bootstrapping step. Please destroy any deployments of it immediately before upgrading kube_policies.
eks_cluster_name is no longer an input to most submodules as it is now dynamically resolved based on which cluster you are deploying to.
This release upgrades the following terraform provider versions which will need to be updated in first-party IaC:
pf: 0.0.4 -> 0.0.5
authentik: 2024.6.1 -> 2024.8.4

Changed

Upgrades Authentik in kube_authentik to 2024.8.2 (release notes).

Fixed

Adds correct permissions to allow users to retry specific Workflow nodes in Argo Workflows.
Adds automatic NATS connection retries to Argo Events components.
Addresses issue in wf_dockerfile_build where the git_ref could not be a branch name.

edge.24-12-11

AWS published an AMI update to their Bottlerocket OS on January 4, 2025 that breaks compatibility with all edge release until edge.25-01-09. You should upgrade your aws_eks and karpenter_node_pools modules directly to edge.25-01-09 to avoid cluster disruption. You may need to manually tweak some inputs (e.g., sla_target, etc.) to ensure proper deployment.

Breaking Changes

All terraform provider versions in Panfactum modules have been upgraded to new values so any first-party IaC modules that utilize Panfactum submodules will need to have their provider versions upgraded as well.
This release upgrades many components of the Panfactum Stack. Generally, none of these upgrades should require any action on your part. However, see the release notes for each component for more information:
Kubernetes: 1.29 -> 1.30
Authentik: 2024.4.2 -> 2024.6.4
Argo Workflows: 3.5 -> 3.6
Karpenter: 1.0 -> 1.1
Redis: 7.2 -> 7.4
Velero: 1.13 -> 1.15
VPA: 1.1 -> 1.2
PostgreSQL: 16.4 -> 16.6

Added

aws_eks and kube_karpenter_node_pools can now configure each node’s root volume size via node_ebs_volume_size_gb.

Fixed

Addresses issue where non-HA clusters could not recover when many nodes are disrupted at once.

edge.24-12-10

Breaking Changes

This release changes the way that public ingress TLS certificates are provisioned in order to avoid hitting rate limits on large clusters. This architectural update requires that the modules be upgraded in the following order:
kube_cert_issuers
kube_ingress_nginx. To avoid service disruptions, you MUST wait until all the old NGINX pods have been fully terminated before proceeding.
The remainder of the modules may be updated in any order.

Fixed

Adds bootstrap_cluster_creator_admin_privileges input to aws_eks to provide backwards compatibility with clusters that were created with this field set to true.
Temporary Authentik disruptions caused by PostgreSQL database failovers have been mitigated.

edge.24-12-05

When upgrading aws_eks to this version, you may receive an error about attempting to recreate the cluster due to this change:

bootstrap_cluster_creator_admin_permissions = true -> false # forces replacement

To workaround this issue, upgrade the aws_eks module directly to edge.24-12-10 and set the new bootstrap_cluster_creator_admin_privileges input to true.

kube_nats in this version contains a bug that forces redeployment of the underlying NATS StatefulSet on every tag / label update. This also impacts kube_argo_event_bus which utilizes NATS under the hood.

This will cause complete loss of any pending NATS messages in any Jetstream streams. For most users, this should be OK as NATS is primarily used for temporary storage as an event bus. However, if you cannot afford to lose your stream data, you should delay upgrading those modules until your cluster reaches edge.24-12-22 which contains the fix.

Due to the default memory floor for kube_argo_event_bus introduced in this release, inbound webhook events for Argo EventSource’s may be rejected intermittently. edge.25-01-04 contains more sane defaults and includes more options for tuning the EventBus to handle different traffic load patterns.

Breaking Changes

This release contains a major version upgrade to Linkerd.

This upgrade removes the need for the privileged proxy-init initContainer to be injected into every container as the initialization logic is now completed once per node. This should reduce pod startup times by 5-20 seconds and improves overall security by removing the need to run a privileged container in each pod.

To upgrade with no downtime, you MUST update the modules in the following order:

kube_kyverno
kube_policies
kube_cilium
kube_linkerd
aws_eks
kube_karpenter_node_pools
The remainder of the modules may be updated in any order.
The NATS backend for kube_argo_event_bus has been replaced with our enhanced NATS module, kube_nats. This provides improved availability, security, observability, and performance.

To apply this module, you will need to manually delete any existing EventBus resources in our cluster, or you will receive an error. You will also need to delete any associated EventSource or Sensor resources before deleting the EventBus or the EventBus deletion will be blocked.

Deleting an existing EventBus will cause any unprocessed events to be deleted. Make sure that you have no pending events before performing this upgrade.

The kube_fledged and kube_reflector modules have been removed (they were deprecated in edge.24-11-13).
The images input of kube_node_image_cache has been updated to take a list of image configuration options rather than a list of image strings.

Additionally, node_image_cached_enabled has been removed as a top-level input from Panfactum submodules (e.g., kube_deployment) as image cache settings can now be configured on a per-container basis.

Changed

Added support for the NATS Jetstream message broker via a new submodule, kube_nats. This release also adds NATS integration with the devShell tooling including adding the nats CLI and updating pf-db-tunnel to support connecting with NATS clusters.
aws_eks now launches with arm64 nodes when bootstrap_mode_enabled is true as we have resolved the remaining issues that have prevented arm64 from being used during bootstrapping.
aws_eks now has EKS access entries enabled.
aws_eks now has ARC Zonal Shift enabled if running nodes in multiple subnets.
kube_ingress_nginx now has ARC Zonal Shift enabled.
kube_vault now schedules pods exclusively on arm64 nodes in order to support the integration of external secret plugins.

Added

The kube_node_image_cache_controller has been updated with a “prepull” component that automatically pulls cached images in parallel as soon as a node launches. Previously, images were pulled serially which resulted in significant delays when many large images were cached.
The kube_descheduler will now automatically recreate pods that were not run through the Kyverno policy engine. This provides protection in case the Kyverno admission controller is ever offline.
Images provided to and/or used by Panfactum submodules (e.g., kube_deployment, kube_pg_cluster, etc.) are now cached by default.
Additional annotations and labels can now be added to the controllers created via kube_deployment, kube_stateful_set, kube_daemon_set, and kube_cron_job.
The kyverno CLI has been added to the devShell.
Adds support for dynamically generated labels in wf_spec via labels_from_parameters and labels_from.
kube_argo_event_source now creates a ServiceAccount and output’s its name. This can be used to assign AWS (or other permissions) to the EventSource pods.
Adds the ability to configure temporary storage space size in wf_tf_deploy.

Fixed

The kube_node_image_cache_controller will now deduplicate images that are added to the cache by kube_node_image_cache.
We have adjusted the Kyverno settings to improve overall stability of the mutation engine.
Resolved slow Vault startup times for Vault databases larger than 100MB in kube_vault.
BuildKit cache PVCs are now excluded from Velero backups as they consume a lot of storage and are safe to delete.
Fixed root user access provisioning in kube_rbac.
Addressed issue where the Descheduler was not replacing pods that were older than the max lifetime.
Addressed issue where resetting a one’s own password via Authentik caused an unauthorized error.
Fixed mount permissions in wf_spec.

edge.24-11-13

This release introduces Kyverno. Unfortunately, we discovered several issues with our initial architecture that could cause degenerate cluster behavior eventually resulting in a full cluster shutdown.

Generally, this takes days to occur, so it is safe to upgrade to this release so long as you immediately continue to upgrade to subsequent releases where the issues are resolved.

All issues were fully resolved in the edge.25-01-04 release.

Breaking Changes

We have added the Kyverno policy engine as a core part of the Panfactum Stack. Kyverno allows us to install rules onto the cluster to automatically generate, mutate, or validate resources based on a powerful, Kubernetes-native expression language. This provides several benefits:
Provides a unified control plane for adding functionality that previously required managing additional controllers or custom scripts.
Allows us to simplify several parts of our IaC modules by offloading resource management to global Kyverno policies.
Allows us to add Panfactum-compatible, sensible defaults to Kubernetes resources that are not created by Panfactum modules.
Allows users to add management logic to their clusters that was previously only possible by building and deploying custom controllers. See the example policies.

You must install Kyverno by following this new bootstrapping guide section. Many modules now depend on Kyverno and will not function without it.

kube_fledged has been removed in favor of a new node-local image caching mechanism built by Panfactum on top of Kyverno. The new mechanism has the following benefits over kube_fledged:
The node’s image cache will be created immediately when a node launches, concurrently with other node setup steps.
Cached images will never be removed from the node’s image store.
Overall controller performance is significantly improved reducing the overall resource requirements for caching.
The caching mechanism no longer generates pods that prevent Karpenter from disrupting underutilized nodes.

To install the new mechanism, please follow this guide. To start caching images, you may use the new kube_node_image_cache module. Additionally, we provide a new input to our submodules such as kube_deployment called node_image_cached_enabled that when enabled will automatically add the submodule’s images to the node-local image cache.

kube_fledged must be removed from your clusters before upgrading to the next version as it will no longer be available in the next release. It should not be removed until Kyverno is installed.

kube_reflector has been removed in favor of a new syncing mechanism built by Panfactum on top of Kyverno.
To sync ConfigMaps, use kube_sync_config_map.
To sync Secrets, use kube_sync_secret.

kube_reflector must be removed from your clusters before upgrading to the next version as it will no longer be available in the next release. It should not be removed until Kyverno is installed.

Vertical pod autoscaling now works for both the PostgreSQL clusters and Pgbouncer deployments created by the kube_pg_cluster submodule. The following variables have been removed:
pg_memory_mb
pg_cpu_millicores

and the following variables have been added:

pg_minimum_memory_mb
pg_maximum_memory_mb
pg_minimum_cpu_millicores
pg_maximum_cpu_millicores
pgbouncer_minimum_memory_mb
pgbouncer_maximum_memory_mb
pgbouncer_minimum_cpu_millicores
pgbouncer_maximum_cpu_millicores

This change also resolves issues where some values for pg_cpu_millicores caused a permanent reconciliation conflict.

All pods in Panfactum clusters will now automatically tolerate arm64 and spot node taints regardless of whether they were created by Panfactum modules (this was already the default for Panfactum modules). To disable these tolerations for a specific pod, you must add the panfactum.com/arm64-enabled = "false" or panfactum.com/spot-enabled = "false" labels, respectively.

Changed

We have upgrade the CNPG operator in kube_cloudnative_pg to 1.24 (up from 1.23). This adds additional stability improvements during failover events.

After performing this upgrade, you MUST use the new kube_pg_cluster submodule as well. Old versions are no longer compatible.

We have upgraded the default PostgreSQL version in kube_pg_cluster to 16.4 (up from 16.2). This upgrade should not require any action on your part, but be sure to pin your PostgreSQL version if you do not want to be automatically upgraded.

Added

Adds a new submodule, kube_daemon_set, for creating Kubernetes DaemonSets.

Fixed

Added Kyverno rule that forces linkerd sidecars to terminate prior to the pod’s terminationGracePeriodSeconds to ensure that pods are not marked as “failed” by controllers such as Argo if the main container has a TCP connection leak.
Resolved unnecessary log noise that was introduced in the last release when running Terragrunt commands.
Adjusted Cilium deployment to address edge cases where Cilium would not successfully launch new nodes after a complete zonal or cluster outage.

edge.24-10-25

A bug has been discovered in this release that can cause a complete cluster crash due to the introduction of the new Kyverno policy engine. Please skip this release and use edge.24-11-13 instead.

edge.24-10-23

Breaking Changes

The required Nix version to use the Panfactum Stack has been updated to >= 2.23 (up from >= 2.18). The latest Nix versions include performance improvements required to make local development ergonomic on all operating systems. Additionally, we have added a check to the loading script (.envrc) to ensure that users have a compatible Nix version installed.

If you installed Nix using the Determinate Systems installer, see these upgrade docs.

Changed

Panfactum modules are now downloaded as gzipped tarballs from an HTTPS server rather than requiring a full git clone of the Panfactum Stack repository. This should dramatically improve initialization speed of modules and reduces network bandwidth by over 90%. This is an internal refactor that should not have any impact on how you use Panfactum modules.

Added

Added a new module, aws_s3_public_website, to enable users to serve files directly from an S3 bucket via CloudFront.
aws_cdn can now handle CORS headers on behalf of the origin servers
aws_cdn now uses 10x more efficient CloudFront functions for request / response mutations.

Fixed

Deploying modules that use Helm charts hosted in ECR (e.g., kube_karpenter) will now use the appropriate credentials.
Upgraded Argo Workflows to fix some issues related to workflow timeouts being ignored.

edge.24-10-21

Breaking Changes

In all Panfactum submodules, instance_type_spread_required has been renamed to instance_type_anti_affinity_required as we have had to replace TopologySpreadConstraints with AntiAffinity rules to work around this issue with Karpenter.

This change will ensure that Karpenter will not randomly create massive nodes.

To add further protection against Karpenter provisioning extremely large nodes, we have two variables for kube_karpenter_node_pools, max_node_memory_mb and max_node_cpu, that limit the maximum size of nodes that can be provisioned.

The default limits are 64GB of memory and 32 CPUs. If you require nodes larger than these limits, you will need to adjust these new inputs.

Fixed

Prevents Karpenter from scheduling instances on bare metal instances which we have observed issues with.
Removes memory limits on the Cilium node agent in kube_cilium as these limits can cause Cilium to fail to launch on larger node sizes. This is due to the fact that Cilium’s memory requirements increase proportionally to the size of the node, but the VPA does not take this into account when assigning limits.
Upgrades kube_ingress_nginx so that it can run on nodes with a large number of CPU cores.
EBS-backed PVs with many large files took a long time to mount due to this issue with Bottlerocket OS (our underlying node OS). We have added the recommended remediation and now PVs should mount nearly instantly. Note that this fix will not apply to existing PVs, only new ones.

To apply the fix to existing PVs, you will need to manually add the following mount option to their manifests:

apiVersion: v1
kind: PersistentVolume
metadata:
    name: XXXX
spec:
    mountOptions:
      - context="system_u:object_r:local_t:s0"

edge.24-10-18

Breaking Changes

We have removed devenv from the Panfactum Stack and now use plain nix flakes to manage the local development shell (aka the “devShell”). We did not use the vast majority of the features in devenv, and its removal comes with a couple key improvements:
Greatly increased performance on macOS. Initial installation should now take ~ 5 minutes (down from 45+). Additionally, opening the devShell after initial installation should now be instant.
More control and flexibility of the Panfactum setup which will allow us to better implement future Panfactum features.

However, this does come with a few key changes that you must perform manually:

The syntax for your flake.nix has changed.

Before:

{
    inputs = {
        # Change 'nixos-23.11' to whichever cut of the nixpkgs repository
        # you want to use in your project. This will NOT impact the Panfactum stack at all.
        # For available versions, see https://github.com/NixOS/nixpkgs
        # We recommend using the version that is supported here:
        # https://search.nixos.org/packages (updated every 6 mo)
        pkgs.url = "github:NixOS/nixpkgs/nixos-23.11";

        # Change 'main' to be the release version that you desire
        # Ensure that this matches the version you use for your infrastructure modules
        panfactum.url = "github:panfactum/stack/edge.25-04-03";
    };

    outputs = { self, panfactum, pkgs, ... } @ inputs: {
        devShells = panfactum.lib.mkDevShells {
            inherit pkgs;
            modules = [ (import ./devenv.nix )];
        };
    };
}

After:

{
  inputs = {
    flake-utils.url = "github:numtide/flake-utils"; # Utility for generating flakes that are compatible with all operating systems
    panfactum.url = "github:panfactum/stack/edge.25-04-03"; # Make sure this matches your version of the Panfactum Stack
  };

  outputs = { panfactum, flake-utils, ... }@inputs:
    flake-utils.lib.eachDefaultSystem
    (system:
      {
        devShell = panfactum.lib.${system}.mkDevShell { };
      }
    );
}

We no longer support devenv syntax, so your devenv.nix file and the .devenv directory can be removed.

For alternatives to all the functionality included in devenv using our new devShell paradigm, please see our documentation.

pf-get-version-hash has been renamed to pf-get-commit-hash to better reflect what it does (get a commit hash given an arbitrary repo and git ref). In addition, it has been updated to take named rather than positional arguments in order to align with other Panfactum scripts. Finally, we have fixed several bugs in the script to make it more resilient to various inputs.
Removes pgadmin4 from the devShell as it significantly increased build times and was not useful to all users. Users should have an option to pick their favorite DB clients rather than us be prescriptive.

Changed

Upgrades kube_cilium to v1.16.3. This change brings new Cilium features, reduces the per-node memory usage by 75MB, and reduces the amount of errors that users can encounter during the bootstrapping guide.
Upgrades kube_aws_ebs_csi to v1.36 in order to support Karpenter v1 disruption taints and improve node shutdown performance.
Updates wf_dockerfile_build to support 10 concurrent image builds per module rather than just one.

Added

Adds cdn_mode_enabled boolean to the kube_vault & kube_authentik module to enable CDN for Vault.
Adds image_tag_prefix string to the wf_dockerfile_build

Fixed

Fixed a handful of scheduling constraint bugs that resulted in less-than-optimal resource utilization. These improvements should result in a significant improvement to resource utilization in tiny clusters and a minor improvement in larger clusters.
Fixed an issue where pf_stack_version could not be a commit hash. It can now be any valid git ref.
Fixed an issue where pf-wf-git-checkout would fail when given a branch name as a git ref. This impact both wf_tf_deploy and wf_dockerfile_build.

edge.24-10-15

Breaking Changes

This release integrates the new Panfactum provider and removes the need to pass many different variables through the module tree.

Additionally, we have upgraded OpenTofu to v1.8 which now supports variables in module source fields. To take advantage of this, we now pass two new inputs to every module by default: pf_module_source and pf_module_ref.

This greatly simplifies developer experience for first-party modules by removing boilerplate with no loss of functionality.

Original:

module "namespace" {
  source = "github.com/Panfactum/stack.git//packages/infrastructure/kube_namespace?ref=c817073e165fd67a5f9af5ac2d997962b7c20367" #pf-update

  namespace = "example"

  # pf-generate: pass_vars
  pf_stack_version = var.pf_stack_version
  pf_stack_commit  = var.pf_stack_commit
  environment      = var.environment
  region           = var.region
  pf_root_module   = var.pf_root_module
  is_local         = var.is_local
  extra_tags       = var.extra_tags
  # end-generate
}

Simplified:

module "namespace" {
  source = "${var.pf_module_source}kube_namespace${var.pf_module_ref}"

  namespace = "example"
}

For more information, see the updated first-party IaC development documentation.

This does come with a couple breaking changes:

Terragrunt no longer passes the following inputs to modules by default as they can be accessed via the Panfactum provider:

pf_stack_version
pf_stack_commit
environment
region
pf_root_module
is_local

The templating system and pf-update-iac have been removed as they are no longer necessary.

kube_ingress no longer allows rewrite_rules to be specified on ingress_configs. Instead, there is now a top-level redirect_rules variables that has enhanced capabilities:
Can pattern match against the entire url (https://google.com/some/path) instead of just the path component (/some/path).
Can specify whether a permanent or temporary HTTP redirect should be used.
kube_ingress no longer allows domains to be specified on individual ingress_configs. Instead, domains is now a top-level variable. This provides better compatibility with the new CDN option and prevents confusing behavior in several edge cases. This also better matches the intent of the module: to provide routing rules for a single set of domains, not to provide routing rules for all domains in your system.

Added

A new module, kube_aws_cdn, has been created that enables setting up a CloudFront distribution (CDN) in front of Ingress resources for improved performance and security as well as reduced server costs. kube_ingress has been updated to support CDN settings.

Additionally, a non-Kubernetes CDN module, aws_cdn, has also been created.

A new module, aws_dns_zones, has been created that allows you to create Route53 zones that have a non-AWS registrar.
Adds the acl_aws_logs_delivery_enabled input to aws_s3_private_bucket which makes it easier to use the bucket for AWS log delivery purposes.
Adds support for Cloudflare in kube_external_dns and kube_cert_issuers

Changed

tls_1_2_enabled now defaults to true in kube_ingress_nginx in order to support CDNs like CloudFront which do not yet support TLSv1.3.

Fixed

The internal logic of aws_dns_records has been updated so that each record is managed independently of the others. This fixes an issue where adding or removing records would cause all records to be created. However, this update will cause all records to be recreated one last time.
pf-wf-git-checkout no longer automatically appends a .git to the end of given repo URLs as this is incompatible with some git hosting providers (e.g., Azure DevOps). This does mean that the repo variable input to wf_tf_deploy and wf_dockerfile_build should be updated to include the .git suffix if required for cloning over HTTP.
Pinned helm provider version for kube_redis_sentinel submodule.

edge.24-10-09

Added

Adds a new terragrunt variable, pf_stack_local_path, that can be used to deploy local copies of the Panfactum Stack modules. This can be used by developers when testing changes to Panfactum modules on personal infrastructure before submitting pull requests to the Stack repository.

Changed

Loosened the requirements for the repo variable repo_url so that we can now support users on arbitrary git hosting providers (not just GitHub).
pf-env-bootstrap is now idempotent, allowing it to be re-run if it fails in the middle of its initial execution.

Fixed

Fixes the ami instance type mismatch when bootstrap_mode_enabled is enabled in the aws_eks module.
Fixes issues that prevented bootstrapping scripts from running with new pf-tf-init logic.
Adjusts the defaults for kube_reflector so that installation does not fail in the bootstrapping guide.

edge.24-09-30

Added

Adds a new addon for self-hosted GitHub Action runners.
Adds the pf-eks-suspend and pf-eks-resume command to Suspend and Resume the EKS Cluster.

Fixed

Fixes an issue where voluntary disruption windows created by the kube_disruption_window_controller would only work for the argo namespace. They will now work in all namespaces.

edge.24-09-12

Breaking Changes

The kube_secrets_csi has been deprecated and should be removed from your clusters. It was primarily used for managing dynamically generated Vault secrets such as database credentials, but we have switched to a new paradigm that uses the Vault Secrets Operator.

This saves approximately 150MB of memory per node in the cluster, improves security by removing pods that needed elevated host-level permissions, and provides better ergonomics for managing dynamically generated secrets in our modules.

kube_pg_cluster’s and kube_redis_sentinel’s superuser_username and superuser_password outputs have been renamed to root_username and root_password, respectively. We made this change because “superuser” implies Vault-generated credentials, which these are not.
pf-providers-enable has been renamed to pf-tf-init as it now has expanded functionality:
Now influences every module in the directory tree where it is run rather than just the module in the CWD.
Now runs init -upgrade on every module to update provider versions and download internal submodules when performing Panfactum version upgrades.
The runtime speed has been improved in order to accommodate running against many modules at once.

We have updated the upgrade guide to reflect that pf-tf-init should be run every time you upgrade the Panfactum version in an environment.

You now no longer need to manually enable providers via the providers array in each module.yaml. Our Terragrunt configuration now automatically detects which providers need to be included at runtime.

No changes are required to take advantage of this new functionality. However, the providers Terragrunt input no longer has any functionality and the providers array can be removed from all module.yaml files. If this leaves a module.yaml empty, the entire module.yaml file can be deleted.

Added

Adds common_env_from_config_maps and common_env_from_secrets inputs to all standard workload submodules to provide the capability to source environment variables from existing ConfigMaps and Secrets, respectively.
kube_pg_cluster and kube_redis_sentinel now support using Vault-generated credentials to authenticate from other workloads. See the module documentation for more information.

Fixed

Adds a controller node preference to pods with controller_nodes_enabled set to true. This optimizes resource efficiency in the cluster as we should prefer to fill controller (EKS) nodes before Karpenter nodes as controller nodes are not automatically scaled.

edge.24-09-10

Breaking Changes

Karpenter has updated its CRD specification which unfortunately requires manual intervention during the upgrade process. After updating the pf_stack_version for any deployments of the kube_karpenter_node_pools module, run the following commands in the kube_karpenter_node_pools folder:

pf-providers-enable
terragrunt state rm kubernetes_manifest.default_node_class \
  kubernetes_manifest.spot_node_class \
  kubernetes_manifest.burstable_node_class \
  kubernetes_manifest.burstable_node_pool \
  kubernetes_manifest.burstable_arm_node_pool \
  kubernetes_manifest.spot_node_pool \
  kubernetes_manifest.spot_arm_node_pool \
  kubernetes_manifest.on_demand_arm_node_pool \
  kubernetes_manifest.on_demand_node_pool
terragrunt apply --auto-approve
kubectl delete nodepools burstable burstable-arm on-demand on-demand-arm spot spot-arm
kubectl delete ec2nc spot burstable on-demand

The kubectl delete commands may take a few minutes to complete as this will force all pods to be rescheduled from nodes created using the old CRDs to nodes created using the new CRDs.

The ports input on kube_deployment and kube_stateful_set has been moved to a container-level field rather than a top-level field to better align with the Kubernetes API.

Added

Adds a new submodule, kube_service, for defining Kubernetes Services that are optimized for the Panfactum Stack. Additionally, integrates kube_service into kube_deployment and kube_stateful_set for automatic Service creation.
Adds extra_storage_classes input to the kube_aws_ebs_csi module.

Fixed

Addressed issue in kube_pg_cluster where non-superuser credentials created by Vault would not have access to database schemas other than public.
Addressed issue where our Terragrunt configuration would cause the version pinning for the goauthentik/authentik and alekc/kubectl infrastructure providers would be removed. This would cause issues to occur when users ran terragrunt init -upgrade to update their lockfiles.

edge.24-09-04

Breaking Changes

Before applying this release, the buildkit-amd64 and buildkit-arm64 StatefulSets in the buildkit namespace must be removed (if kube_buildkit is deployed).
In preparation for our upcoming release, we cleaned up a handful of naming conventions which impact the inputs and outputs of several modules:
In kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility:
ready_check_ prefixed fields have been changed to readiness_probe_ to better align with the actual Kubernetes API.
liveness_check_ prefixed fields have been changed to liveness_probe_ to better align with the actual Kubernetes API.
image and image_version have been replaced with image_registry, image_repository, and image_tag to provide a clearer description of each constituent part and better align with ecosystem conventions.
secrets has been renamed to common_secrets to better align with its counterpart, common_env.
pod_annotations has been renamed to extra_pod_annotations to better align with its counterpart, extra_pod_labels.
readonly and has been renamed to read_only to better align with our casing conventions.
read_only_root_fs has been renamed to read_only for better consistency across modules.
instance_type_anti_affinity_required has been renamed to instance_type_spread_required to better reflect that the underlying mechanism is a pod topology spread constraint.
topology_spread_enabled has been renamed to az_spread_preferred to better reflect actual behavior.
topology_spread_required has been renamed to az_spread_required to better reflect actual behavior.
zone_anti_affinity_required has been renamed to az_anti_affinity_required to better align naming conventions with other settings that control scheduling based on availability zone.
Renamed Panfactum-provided priority classes to improve semantics (see docs).
In kube_pg_cluster and kube_redis_sentinel:
spot_instances_enabled, arm_instances_enabled, and burstable_instances_enabled have been changed to spot_nodes_enabled, arm_nodes_enabled, and burstable_nodes_enabled to better align with the inputs of other modules.
In kube_constants, a few outputs have been updated:
panfactum_image has been renamed to panfactum_image_repository to better align with naming conventions in other Panfactum modules
panfactum_image_version has been renamed to panfactum_image_tag to better align with naming conventions in other Panfactum modules
We have removed a handful of options from kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility that we would never recommend using:
prefer_spot_nodes_enabled, prefer_burstable_nodes_enabled, prefer_arm_nodes_enabled: These scheduling preferences are unnecessary as Karpenter will already prefer the cheapest nodes.
az_anti_affinity_preferred: az_spread_preferred should be used instead.
When we introduced the concept of the enhanced_ha_enabled input, it was designed as a cost-saving switch for direct modules where users do not need to have a deep understanding of the internals. However, it has also found its way into some submodules where it has created ambiguity about module behavior, especially since its impact differs module-to-module. As a result, we have replaced the enhanced_ha_enabled input in all submodules with more granular tuning knobs that have clearer behavior. This impacts the following submodules: kube_pg_cluster, kube_redis_sentinel, kube_vault_proxy, kube_argo_event_bus, and kube_argo_event_source.
Nodes managed by EKS Node Groups (vs Karpenter) are now tainted with controller=true:NoSchedule. We have added this taint as pods scheduled on these nodes might be disrupted regardless of their PDBs during EKS upgrades. For some workloads this could cause a disruption. Most workload submodules have a new input, controller_nodes_enabled, that can be used to allow your workloads to tolerate this taint if desired.
Previously we were conservative about enabling certain features by default in some of our submodules in order to ensure our modules would be compatible with non-Panfactum Kubernetes clusters. However, this is a very niche use case, and we have observed that this results in extra mental overhead for our normal users to avoid missing out on the core features provided by the Panfactum Stack. As a result:
The following flags are now enabled by default in kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, kube_pg_cluster, kube_redis_sentinel, and kube_workload_utility:
spot_nodes_enabled
arm_nodes_enabled
vpa_enabled
panfactum_scheduler_enabled
The following flags are now enabled by default in kube_deployment:
az_spread_preferred
The following flags are now enabled by default in kube_stateful_set:
az_spread_required
instance_type_spread_required
The following inputs are now enabled by default in all modules:
pull_through_cache_enabled
The following inputs are now enabled by default in all direct modules deployed after the autoscaling section in the bootstrapping guide:
vpa_enabled
panfactum_scheduler_enabled

Added

Adds built-in default downward-api integrations in all our workload submodules.
All mounted ConfigMaps and Secrets in our workload submodules are now mounted as executable to make it easier to mount scripts.

Fixed

Updates Karpenter and EBS CSI Controller to prevent any remaining edge cases where nodes were terminated prior to EBS volumes being detached which would result in six-minute delays for rescheduling stateful pods.
Remove the RemoveDuplicates strategy in kube_descheduler as users expect to be able to schedule multiple pods of the same controller on the same node when they set host_anti_affinity_required to false.

edge.24-08-27

Breaking Changes

We removed the ability to disable S3 backups in kube_pg_cluster. The backups have an extremely low cost impact and significantly improves the durability of data. Moreover, the continuous WAL archiving provided by the backups improves our system’s ability to automatically recover in the case of failover events.

Ultimately, we found that the risk of misuse (resulting in unexpected data loss or downtime) significantly outweighed any potential benefits gained by providing this functionality.

Added

Added native support for restoring from database backups to the kube_pg_cluster submodule.
Added automatic creation of an immediate base backup to the kube_pg_cluster to ensure that new databases can be recovered all the way up to their point of creation.

Fixed

Mitigated a rare scenario where disruption in the middle of a database failover would result in the PostgreSQL databases being unable to restart without manual intervention in the kube_pg_cluster submodule.
Fixed an issue where pf-get-repo-variables would provide the wrong directory for the root of the repository when run inside a downloaded .terragrunt-cache directory.

edge.24-08-24

Fixed

Addressed a couple of issues with the kube_authentik module:
authentik_core_resources will no longer fail to apply and end up in an invalid state when first created.
Authentik should no longer experience any downtime during database failover events

edge.24-08-23

Fixed

Correctly sets PgBouncer permissions on new PostgreSQL cluster creation in kube_pg_cluster.

edge.24-08-22

Breaking Changes

The default behavior of kube_redis_sentinel was to use both Redis AOF and RDB for persistence. Unfortunately, using AOF concurrently with RDB negates Redis’ the ability to do partial resynchronizations after restarts and failovers. Instead, a full copy of the entire database must be transferred from the current master to replicas on every restart. This greatly increases the time-to-recover as well as incurs a high network cost.

In fact, there is arguably no benefit to AOF-based persistence with our replicated architecture as new Redis nodes will always pull their data from the running master, not from their local AOF. The only benefit would be if all Redis nodes simultaneously failed with a non-graceful shutdown (an incredibly unlikely scenario).

As a result, we have switched the module to use only use RDB for persistence, and the redis_appendfsync input has been removed. The module still provides the ability to provide custom redis configuration, so you can re-enable AOF if you want (though we would not advise it).

token_lifetime_seconds has been changed to token_lifetime_hours in vault_auth_oidc to avoid a perpetual diff issue present in the Vault provider.
Removed the daily backups from kube_velero as they were undocumented and had no realistic use case.

Added

Adds a new submodule, kube_disruption_window_controller, which can be used to specify time-based disruption windows for disruption-sensitive workloads (e.g., databases). Disruption window capabilities have also been added to kube_pg_cluster and kube_redis_sentinel.
Adds synchronous replication support to kube_pg_cluster via pg_sync_replication_enabled.

Fixed

Addressed issue where pg_smart_shutdown_timeout cannot be set to 0 in kube_pg_cluster without having CNPG reset it to 180.
Fixed an issue in kube_velero where stale EBS snapshots were not being deleted.
Added stricter disruption prevention to the Velero server in kube_velero as disrupting the server in the middle of a backup operation would cause it to fail and not be resumed.

edge.24-08-15

Breaking Changes

pg_shutdown_timeout has been renamed to pg_smart_shutdown_timeout to better indicate its purpose in kube_pg_cluster. Additionally, the shutdown and failover logic has been overhauled. The new default will immediately terminate running queries when a database pod is killed, but this serves to reduce the downtime from 60-120 seconds to < 5 seconds in the failover scenario. Please see the module documentation for more information.

Added

Adds the concept of passthrough parameters to wf_spec.
Makes tf_apply_dir a Workflow parameter in wf_tf_deploy so that you only need a single instance of this module per cluster.
Adds the ability to use templateRef to compose Workflows in wf_spec.

Fixed

Fixed the working directory in wf_tf_deploy and wf_dockerfile_build to be inside the cloned repository.
Addressed OOM errors when using resource templates with wf_spec.

edge.24-08-13

Breaking Changes

pg_storage_increase_percent has been changed to pg_storage_increase_gb in kube_pg_cluster. This allows for more predictable storage autoscaling and optimal resource provisioning regardless of the current storage scale.
pg_storage_gb has been changed to pg_initial_storage_gb in kube_pg_cluster. This better indicates that this value is only used during the initial database provisioning and has no effect thereafter.
node_vpc_id, node_subnets, and node_security_group_id have been moved from kube_karpenter to kube_karpenter_node_pools in order to simplify the logic of assigning nodes to subnets, VPCs, and security groups. Additionally, we have removed Karpenter auto-discovery tags as they are no longer necessary.

Added

Adds new enhancements to the kube_pg_cluster module:
Better defaults and options for memory tuning
Provides the ability to set arbitrary PostgreSQL parameters
Provides the ability to set a custom backup schedule
Adds support for additional schemas via the extra_schemas input
Adds another local retry for Terragrunt when providers produce an inconsistent final plan.
Adds check for an updated direnv version to prevent issues when setting up the local devenv.

Fixed

Added deterministic ordering to additional resources in authentik_core_resources.
Fixed the following bugs in pf-env-bootstrap:
Would use a non-existent AWS profile for the .sops.yaml file.
Would not install all the platform checksums in the .terraform.lock.hcl files.
amd64 nodes are now used when bootstrapping_enabled is true in aws_eks in order to allow certain bootstrapping tests (e.g., Cilium) to run successfully.
Restores the pf-db-tunnel command to the devenv.
pf-get-version-hash local now properly returns local without an error code.
Updates the Panfactum image version in kube_constants to a version that is compatible with the latest pre-built workflows.

edge.24-08-12

Breaking Changes

Repository variables must now be defined in a panfactum.yaml file located at the root of your repository instead of in your devenv.nix. Additionally, the variables names are no longer prefixed with PF_ and are lowercase.

For example, env.PF_REPO_NAME in devenv.nix should now be defined at repo_name in panfactum.yaml.

This change was made to make it easier to reference these values outside of local development contexts such as within CI pipelines where devenv.nix isn’t loaded.

Added

We have provided two new addons, a Workflow Engine (Argo Workflows) and an Event Bus (Argo Events).
We have created a guide and best practices for setting up CI / CD in the Panfactum Stack.
To support the new addons, we are upgrading the following infrastructure modules to Beta status:
kube_argo: For deploying the Argo controllers
kube_argo_event_bus: For deploying an Argo EventBus
kube_argo_event_source: For deploying an Argo EventSource
kube_argo_sensor: For deploying an Argo Sensor
wf_spec: For creating an Argo Workflow specification
wf_tf_deploy: For creating an Argo WorkflowTemplate that deploys IaC modules
wf_dockerfile_build: For creating an Argo WorkflowTemplate that builds container images from Dockerfiles
Adds pf-get-repo-variables which prints a JSON payload of all repository configuration variables with the appropriate defaults set.

edge.24-07-08

Breaking Changes

We have made a small, breaking refactor of aws_eks to reduce unnecessary options that made onboarding and maintenance more difficult:
Most importantly, users will no longer able to set the instance type and count for nodes in EKS node groups. This flexibility is unnecessary since node provisioning is handled by Karpenter and not EKS. Moving forward, there are just two static configurations that are guaranteed to work in all use cases: one for before autoscaling is installed and one for after. This is controlled via the new input, bootstrap_mode_enabled (default: false).
control_plane_version and controller_node_kube_version have been unified into a single variable called kube_version that applies to all subsystems.
controller_node_subnets has been renamed to node_subnets to indicate these subnets are used for all cluster nodes, not just the EKS node groups.
all_nodes_allowed_security_groups has been renamed to node_security_groups to align naming conventions
By default, PVCs created by controllers such as StatefulSets can not be updated through their controller as their template (volumeClaimTemplates) is immutable (a Kubernetes limitation). This poses a challenge when needing to update PVC metadata such as annotations and labels. We have built a workaround to this (kube_pvc_annotator) and incorporated it in various Panfactum modules. Unfortunately, incorporating this enhancement requires redeploying StatefulSets.

To complete this upgrade, perform the following steps:

Create a Velero backup of the cluster by running velero create backup -w <backup_name> to recover in case of mistakes.
The following StatefulSets need to be deleted in this order AND with kubectl delete --cascade=orphan AND immediately restored with a subsequent terragrunt apply to their defining module:

The Vault StatefulSet created by kube_vault
The Redis cluster StatefulSet for Authentik created by kube_authentik
The BuildKit StatefulSets created by kube_buildkit
Any StatefulSets you have provisioned with kube_stateful_set
Any Redis clusters StatefulSets you have provisioned with kube_redis_sentinel

As long as you use --cascade=orphan and take care to minimize the time between the kubectl delete and terragrunt apply, there will not be any downtime during this operation.

After completing this operation, you need to delete the backing PVCs from each module one at a time by deleting the PVC and then deleting its bound pod. The controller will then automatically provision a new PVC with the correct labels and annotations to take advantage of the new functionality.

After deleting each pod, ensure that a new pod is automatically provisioned and becomes healthy before proceeding to the next. As long as you proceed one at a time, this will not cause any downtime or data loss.

Delete the Velero backup you created in step 1 by running velero delete backup <backup_name>.

Added

Adds kube_fledged to the core stack. The kube-fledged controller adds the ability to pre-pull images to every node to improve pod startup times for critical or frequently used containers such as the Linkerd proxy or database images. We provide instructions for installing this module here
Adds the kube_pvc_annotator submodule that will provision a CronJob to run pf-set-pvc-metadata against PVCs created by immutable templates. See the module documentation for potential use cases.
Adds persistence_backups_enabled (default: true) to kube_redis_sentinel to support disabling EBS snapshot backups.
Adds a new common variable, node_image_cache_enabled, to Panfactum modules that can be used to enable pre-pulling images to nodes via the kube_fledged operator.
Adds the pf-buildkit-clear-cache command for removing any BuildKit caches not being used by an active image build job.
Adds the pf-set-pvc-metadata utility command for syncing labels and annotations across groups of PVCs.

Fixed

Fixes handling of public ECR registries in docker-credential-panfactum.
Fixes handling of ECR token caching in docker-credential-panfactum.
Fixes pf-get-open-port to be platform-agnostic.
Fixes pf-get-version-hash to work with commit hash inputs.
Fixes image paths in the Authentik dashboard for applications provisioned by Panfactum modules.

edge.24-07-01

Breaking Changes

The input format to aws_ecr_repos has been reformatted to support better per-repository configuration. This should not require replacing any resources, but it will require updating your Terragrunt inputs.
The following resources will no longer be tagged with the Panfactum version and commit hash as updates cause unnecessary delays and disruptions during updates for little added value:
EC2 instances in EKS node groups generated by aws_eks
EC2 instances serving as NAT hosts in aws_vpc
KMS replica keys in aws_kms_encrypt_key
Pods created in kube_bastion

Added

kube_buildkit has graduated to beta and is now ready for general consumption. This is the first stack addon that can be used to extend the behavior of the core stack. Installation and usage instructions can be found here.
aws_ecr_repos now supports custom image expirations rules and both pull and push permissions.
aws_ecr_public_repos has been added to support created public ECR repositories.
Adds ARM support in kube_bastion and kube_pvc_autoresizer. All core cluster components can now be run on both amd64 and arm64 nodes allow for optimal cost savings.
Changes the default securityContext.fsGroupChangePolicy to OnRootMismatch for Pods created by Panfactum submodules in order to improve PVC mounting performance.
pf-providers-enable now ensures that .terraform.lock.hcl files have all common platform checksums.
Adds pf-get-terragrunt-variables which can be used to derive the Terragrunt variables that would be used if Terragrunt were run in the given directory.
Adds pf-tf-delete-locks which can be used to bulk-release Tofu state locks.
Adds pf-sops-set-profile which will update all sops-encrypted files in the given directory to use the indicated AWS profile for KMS operations. This can be used in CI pipelines to allow the CI user to access sops-encrypted files.
(Alpha) Adds kube_argo_sensor and kube_argo_event_source submodules for deploying these core components of the Argo Events system.
(Alpha) Adds the kube_workflow_spec submodule to help in defining production-ready Argo Workflows.

Fixed

kube_aws_ebs_csi has been adjusted to ensure that PVCs are detached from nodes during node shutdown, preventing unnecessary delays in moving PVCs between nodes.
kube_core_dns no longer accidentally includes the Vault provider.
kube_ingress_nginx will no longer unnecessarily set browser security headers on 3xx responses or responses that do not have Content-Type headers.

edge.24-06-20

Breaking Changes

kube_karpenter has upgraded the Karpenter version to v0.37. During this release cycle, the Karpenter team moved the CRDs required by Karpenter to a dedicated Helm chart to improve the upgrade ergonomics. Unfortunately, this introduces a few one-time manual steps that you must perform to enable the migration. Specifically, the following commands must be run against your cluster before applying the latest version of kube_karpenter:

kubectl label crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh app.kubernetes.io/managed-by=Helm --overwrite
kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-name=karpenter-crd --overwrite
kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-namespace=karpenter --overwrite

kube_karpenter_node_pools has a new input node_labels which defines what labels will be applied to generated nodes. The standard Panfactum labeling system will no longer apply to Karpenter nodes due to this upstream issue.
The persistence_enabled option was removed from kube_redis_sentinel. Redis is now always deployed with persistence enabled. This decision was made b/c the cross-AZ network costs of re-instantiating Redis nodes without PVC storage dwarf the costs of the PVC storage (by a factor of 100x). As a result, there is no benefit to not periodically saving the redis database to a persistent disk.

To compensate for potential performance impacts, we have exposed another input, redis_appendfsync. Setting this to "no" will achieve the same performance as having persistence disabled. However, the default setting of "everysec" is likely sufficient for the vast majority of use cases and reduces the risk of data loss.

Unfortunately, if you were previously running with persistence_enabled set to false, you will need to delete the Redis StatefulSets in order to apply the new module.

In particular, this impacts the kube_authentik module. Before deleting the Redis StatefulSet for Authentik, ensure your Vault token is not expired as you will not be able to re-authenticate with Authentik while the Redis StatefulSet is removed.

Since persistence_enabled should only have been used in scenarios where data retention was not important, this should be considered a safe operation. However, it will introduce a minor service disruption during the replacement period.

aws_ecr_pull_through_cache_addresses has been refactored to improve the ergonomics of using the module. It now requires an input, pull_through_cache_enabled, and will output the correct registry names regardless of whether using a pull through cache or not.

Added

kube_deployment, kube_stateful_set, kube_cron_job, and kube_pod have graduated to Beta status. They are now safe to use.
Adds the pf-providers-enable command that will automatically inspect the source infrastructure module and enable the required providers in a module’s module.yaml.
Adds the pf-update-iac command that will update first-party infrastructure modules in the following ways:
Executes the templating directives.
Updates the ref in sourced Panfactum submodules to the commit hash of the devenv if the # pf-update annotation is provided. See the documentation for more details.
Adds phone number validation in aws_account.
Adds cors_enabled (default: false) input variable to kube_vault that can enabled CORS handling.

This can be useful when building web applications that interact with Vault in client-side JavaScript. By default, this will allow CORS requests from all sibling and child domains.

Fixed

Addresses an issue in kube_authentik that prevented the SSO login pop-up from working.
Implements custom CORS handling logic in kube_ingress that resolves issues in the default behavior provided by the NGINX ingress controller.
Removes invalid failure cases when using pf-get-vault-token in Terragrunt and improve failure messaging.
Fixes an issue that occurs when the kubernetes provider is enabled but the sourced module does not use the kubectl provider.
Fixes failure cases in pf-env-scaffold and adds more debug logging.

edge.24-06-14

Added

Adds kube_scheduler, an alternative Kubernetes scheduler that can be used to improve bin-packing of pods on nodes in the Kubernetes cluster. This allows for better, smaller node selection and our tests show an estimated 25-33% reduction in node costs when used. We provide instructions for installing it here.
Adds panfactum_scheduler_enabled (default: false) input to most infrastructure modules. When enabled, will use the scheduler provided by kube_scheduler instead of the less-efficient EKS scheduler.
If panfactum_scheduler_enabled is true, the kube_descheduler will automatically remove pods from low utilization nodes to allow the kube_scheduler to bin-pack them on other nodes.

Fixed

Addresses a bug in the previous release that left kube_karpenter not deployable.
Addresses an issue where nodes were limited a hard cap of 29 pods.
Configures Kubernetes nodes to use a fixed amount of system overhead rather than one that scales unnecessarily with node size.

edge.24-06-13

Added

Updates kube_pg_cluster with many new variables for configuring PgBouncer. New variables are prefixed with pgbouncer_.
Adds support for path_prefix to kube_vault_proxy (@mschnee)
Adds new enhanced_ha_enabled input to many core modules (default true). Setting this to false will allow for additional cost savings (approximately $50 / month) in exchange for introducing a small possibility of temporary outages. We estimate that setting this to false reduces availability from 99.995% to 99.9%. This can be used to decrease costs in less critical clusters (e.g., development).
Adds a Spot Data Feed to the aws_account module.
Adds the kube_open_cost module for calculating the cost of workloads running on Kubernetes.

Fixed

Addressed issue in aws_vpc where NAT nodes wouldn’t restart if NAT setup failed with an exit code other than 1.
Increased the memory floor of the Authentik server in kube_authentik to avoid OOM issues.
Updates kube_authentik to allow showing Gravatar profile images.
Updates kube_authentik to provide the necessary Permissions-Policy headers to allow use of WebAuthn devices.
Correctly applies pod labels in kube_aws_lb_controller.
Removes node preferences defaults from kube_workload_utility that were preventing efficient node deprovisioning.
Adjusts the VPA recommendation overhead from 30% to 15% to improve resource utilization.
Fixes incorrect SCIM property mapping in authentik_aws_sso.
Aligns pod labels, affinities, topologySpreadConstraints, and tolerations in kube_linkerd to conventions used in all other modules.

edge.24-06-08

Added

Updates aws_vpc to support new command pf-vpc-network-test that will verify network connectivity properties of the instantiated VPC. This allows us to simplify an otherwise complex validation step in the bootstrapping guide.
Adds the pf-env-bootstrap command that automatically bootstraps the necessary resources to begin working with IaC in an environment. This replaces the manual steps that used to be a part of the bootstrapping guide.
Adds new extra_inputs terragrunt variable that allows you to pass inputs to all modules in the current scope.
Adds arm64 NodePools and arm64 support for the core components. This reduces the cost of running the base stack by $25 - 50 / month due to significantly better price / performance ratios for arm64 instances in AWS.
Sets unhealthyPodEvictionPolicy to AlwaysAllow for all module PDBs. This will allow the system to scale up quicker when running against resource pressure and pods become stuck in a temporary crash loop.
Sets maximum node lifetime to 24h to force Karpenter to try to consolidate instances at least once per day.

Fixed

Addressed issue where the aws-ebs-csi-driver DaemonSet pods would not be properly terminated by Karpenter during node shutdown. This resulted in EBS volumes not being detached and introduced an unnecessary 6min delay when moving EBS volumes between nodes.
Replaces most usages of kubernetes_manifest with kubectl_manifest to avoid type manifest parsing issues that prevent dynamic values in manifests.

edge.24-06-06

Breaking Changes

kube_trust_manager has been deprecated as it’s functionality was redundant with kube_reflector. We are keeping the module in the repo to support backwards compatibility, but it will be removed in the future. You should perform the following steps to remove it:
Apply this release.
Remove any dependency blocks to it in your terragrunt.hcl files.
Run terragrunt destroy on the module to remove it.
Delete the bundles CRD.

Added

aws_registered_domains can now set the contact type for each contact.
Allow users to reference availability zones by single character (e.g., a) in addition to the full name (e.g., us-east-2a) in the aws_vpc module.
The manual steps needed to reset new EKS clusters to a clean slate during the bootstrapping guide have been consolidated into a single new command, pf-eks-reset.

Fixed

Addressed issue in aws_vpc that caused a temporary, harmless error to crash the terragrunt apply on initial bootstrapping.
Fixed issue where Cilium test suites would fail during bootstrapping due to a NetworkPolicy blocking the kube_core_dns module.

edge.24-06-04

Breaking Changes

The reloader deployment must be deleted before the next apply of kube_reloader. No inputs have changed.
The alpha module kube_labels has been removed in favor of the labels provided by kube_workload_utility.
VPC flow logs in aws_vpc are now disabled by default as they can be fairly expensive and should only be used if you have a specific use-case in mind. They can be enabled by setting vpc_flow_logs_enabled to true.

Added

Added new pf-env-scaffold script that takes care of setting up the PF_ENVIRONMENTS_DIR in the bootstrapping guide section for setting up terragrunt.
Added kube_workload_utility to make it easier to create uniform, production-hardened Pod specs that take advantage of all capabilities included in the Panfactum stack.
A new standard label panfactum.com/workload can be used to group replicated pods for the purpose of aggregating metrics. This is now applied in all core infrastructure modules.
Added kube_constants that export static configuration values that can be useful when creating resources that run on clusters in the Panfactum stack.
kube_cert_manager will now automatically delete Certificate secrets if the Certificate is deleted.
aws_ses_domain now takes an optional input smtp_allowed_cidrs that restricts what IPs can use the generated SMTP credentials. This allows users to mitigate credential exfiltration attacks. We provide an example of how to use this here.
The Vault login UI will now have the OIDC login as the default method.
Terragrunt will now automatically retry on some errors up to three times before exiting the process with a failure. This should address intermittent issues such as network disruptions or race conditions.

Fixed

.env files are now properly loaded into the shell environment and changes will trigger fast reloads instead of full devenv re-evaluations.
Temporarily adds GIT_CLONE_PROTECTION_ACTIVE=false to the shell environment in order to address this issue. Note that this only disables new bleeding edge security features which were accidentally shipped in a broken state.
Adjusts base resource requests of core infrastructure modules to prevent temporary OOM errors when bootstrapping before VPA take effect.
kube_authentik now respects log_level input.
Sets max_history to 5 for all Helm charts to prevent overloading the Kubernetes API server with an every-growing amount of historical Helm deployments.

edge.24-06-02

Breaking Changes

Upgraded to devenv 1.0. As a part of this upgrade, .env file values can no longer be referenced directly inside .nix files.

Added

Updated kube_redis_sentinel to automatically limit client buffer size to prevent OOM issues when processing very bursty traffic.
Added pf-update command that runs all the repository scaffolding commands at once.

Fixed

Addressed an issue that caused updates to the local devenv to take at least 10 minutes rebuild on macOS. Rebuilds should now be 10-15x faster, but they will still take about 45 seconds at minimum. Note that this only impacts rebuilds and not normal direnv load times which should still be instant.

This is a known limitation of upstream nix’s derivation evaluation caching when using flakes. We expect this to be addressed when flakes reach stability.

Added missing defaults for PF_ENVIRONMENTS_DIR and PF_IAC_DIR.
Resolves an issues where devenv warnings could not be resolved during the initial bootstrapping guide.
Added extra validation for the terragrunt variable extra_tags. Invalid characters will now be replaced with . for both keys and values for both Kubernetes labels and AWS tags.
Fixed some core components that were using all Kubernetes labels for labelSelector matching rules which prevented Karpenter from autoscaling when extra_tags was provided. This previously manifested as the error spec.requirements: Too many: #: must have at most 30 items.
Added extra constraints to kube_external_dns to prevent it from attempting to query zones that it isn’t managing.
Prevented kube_external_dns from excluding parent domains of included domains.

edge.24-05-30

Breaking Changes

The default for vault_storage_size_gb in kube_vault has been changed from 20 to 2 in order to improve resource utilization. If you created Vault with the old default, you will need to manually set vault_storage_size_gb to 20 as volume sizes cannot be reduced after creation.

Added

(Alpha) Added the Loki logging backend via kube_logging and the Alloy log collector via kube_alloy.
The PVC Autoresizer has been added via the kube_pvc_autoresizer module in order to automatically expand EBS volumes as they fill up. We provide the guide for deploying it here.
Added validation for phone number format in aws_registered_domains. (@wesbragagt)

Fixed

Resolved issue where scheduling constraints could not be resolved for components deployed before Karpenter (#41)

edge.24-05-23

Breaking Changes

We have removed the EKS CoreDNS addon and replaced it with the kube_core_dns module in order to provide better guarantees about the behavior of DNS in the Panfactum stack. In order to migrate:

Add the dns_service_ip input to aws_eks deployments by following this guide. Double check that the dns_service_ip is the same IP as defined by kube-system/kube-dns.
Additionally, set core_dns_addon_enabled to true.
Apply the updated module aws_eks module.
Add the cluster_dns_service_ip input to your kube_karpenter_node_pools module like this, and re-apply the module. Ensure that all of your nodes have been replaced with the new configuration.
Deploy kube_core_dns by following this guide. Note that this deployment will fail as the original addon service is still running and the IP is already taken.
Delete kube-system/kube-dns and re-apply kube_core_dns. Note that while the service is deleted, DNS will be temporarily unavailable in your cluster.
Once you’ve validated that DNS is working in the cluster, remove the core_dns_addon_enabled input from the aws_eks module and re-apply.

We have stabilized the label selectors in kube_pod but this requires one final label update for already-deployed Deployments. This will cause re-applies of kube_bastion to fail (and any first-party modules that rely on kube_deployment). To resolve, you must first manually delete the bastion/bastion deployment (and all other deployments created by kube_deployment).
kube_pg_cluster has two new flags, pgbouncer_read_only_enabled (default false) and pgbouncer_read_write_enabled (default true), which will enable the r and rw poolers, respectively. This will enable users to better control what is deployed so as not to have idle resources. This is a breaking change as pgbouncer_read_only_enabled is set to false by default.

Added

(Alpha) We’ve added a monitoring stack kube_monitoring which includes HA Prometheus, the Prometheus Operator, Thanos metrics storage on S3 (with deduplication, caching, and down-sampling), the Node Exporter, kube-state-metrics, Alertmanager, and Grafana (with SSO enabled and 20+ custom dashboards).

Additionally, most modules now have an additional monitoring_enabled (default false) flag that can be turned on to being shipping data to Prometheus for viewing and querying via Grafana.

(Alpha) kube_cilium now has a new debugging mode, hubble_enabled (default false), that will capture extensive TCP-level metrics about the cluster as well as expose a debugging UI via HTTPS.
(Alpha) kube_linkerd now deploys Linkerd Viz when monitoring_enabled = true. This provides a service mesh dashboard and the ability to capture and introspect raw HTTP requests sent in realtime.
(Alpha) We’ve added the Argo Workflow engine to the stack via the kube_argo module. This will serve as the basis for the future, integrated CI / CD systems and can also be used to process arbitrary events from event queues such as AWS SNS/SQS and Kafka. (@jlevydev)
A new module, kube_vault_proxy, that can be used to add SSO to web assets that do not have integrated SSO. The module SSO is configured out-of-the-box to work with the cluster’s Vault instance.
We’ve included a new Kubernetes provider, kubectl, to augment the original kubernetes provider. The kubectl provider allows more flexibility in deploying raw Kubernetes manifests which is required by our templating system. This provider will automatically be enabled the kubernetes provider is enabled, so no additional changes are required from end users.
kube_redis_sentinel has a new flag, lfu_cache_enabled, that will configure the Redis cluster automatically evict records under memory pressure based on an approximated Least Frequently Used algorithm.
kube_ingress now takes an extra_configuration_snippet variable which allows for additional commands in the NGINX configuration snippet.

Changed

Added the standard Restricted Reader role to Vault instances (rbac-restricted-reader) and updated vault_auth_oidc to take restricted_reader_groups. Since cluster resources authenticate with SSO via Vault, this allows restricted readers to access additional cluster resources such as Grafana and Argo Workflows (albeit, in a locked-down read-only mode).
Disabled evictions of database pods based on max lifetimes. This improves the stability of databases deployed by Panfactum modules.
After completing the bootstrapping guide, we now recommend that users update their aws_eks cluster modules to have controller_node_count set to 1 and controller_node_instance_types set to ["t3a.medium"]. This will decrease the costs of the base cluster by about 40% without impacting cluster availability or resiliency. The single remaining node is used primarily as a place for Karpenter to run (Karpenter cannot run on instances that it itself provisions).
kube_karpenter now only deploys a single instance of Karpenter and enforces that it is run on a controller node. This reduces the overall resource utilization of this fairly heavyweight controller.
Kubernetes labels applied via the extra_tags terragrunt input are now sanitized for valid characters automatically (invalid characters are replaced with .). (@mschnee)
Added scheduling constraints to prevent critical workloads from scheduling all pods on the same instance type in order to minimize the possibility of disruption on events that only affect one instance type (e.g., spot node preemption).
Changes many other non-critical core controllers to only have a single replica when 100% uptime is not necessary in order to reduce resource utilization in the Stack.
Updates many controller deployments to use the Recreate deployment strategy to improve timing and efficiency of applying Panfactum upgrades.
kube_vpa has a new history_length_hours (default 24) that will control how far back it will analyze metrics for computing its recommendations.

Fixed

PVCs for postgres instances were inadvertently created with duplicated entries for accessModes. This has no functional impact, but confused monitoring systems. This has been fixed, but the fix will not retroactively adjust existing PVCs as they are immutable.

edge.24-05-15

Breaking Changes

kube_vault now takes vault_domain as an input instead of environment_domains. This change was made as having multiple domains for Vault is incompatible with using Vault as an intermediary IdP.

Added

New kube_reflector module for deploying the Reflector in order to synchronize ConfigMaps and Secrets across namespaces. Created a new guide section for deploying the module as a part of the foundational Stack.
pg_shutdown_timeout variable to kube_pg_cluster to control how long the postgres instances will wait for active connections to close before shutting down.

Fixed

Fixed an issue where simultaneous, graceful shutdown of all postgres nodes in a kube_pg_cluster would cause unnecessary downtime when the primary was running on a spot instance.

edge.24-05-12

The initial edge release of the Panfactum stack!