Edge Releases

Edge releases do not receive patches nor make any backwards compatibility guarantees. Learn more here.

To use stable Panfactum releases, please see our available licenses.

To upgrade your Panfactum stack version, please follow the instructions in the upgrade guide.

edge.24-09-12

Breaking Changes

  • The kube_secrets_csi has been deprecated and should be removed from your clusters. It was primarily used for managing dynamically generated Vault secrets such as database credentials, but we have switched to a new paradigm that uses the Vault Secrets Operator.

    This saves approximately 150MB of memory per node in the cluster, improves security by removing pods that needed elevated host-level permissions, and provides better ergonomics for managing dynamically generated secrets in our modules.

  • kube_pg_cluster's and kube_redis_sentinel's superuser_username and superuser_password outputs have been renamed to root_username and root_password, respectively. We made this change because "superuser" implies Vault-generated credentials, which these are not.

  • pf-providers-enable has been renamed to pf-tf-init as it now has expanded functionality:

    • Now influences every module in the directory tree where it is run rather than just the module in the CWD.
    • Now runs init -upgrade on every module to update provider versions and download internal submodules when performing Panfactum version upgrades.
    • The runtime speed has been improved in order to accommodate running against many modules at once.

    We have updated the upgrade guide to reflect that pf-tf-init should be run every time you upgrade the Panfactum version in an environment.

  • You now no longer need to manually enable providers via the providers array in each module.yaml. Our Terragrunt configuration now automatically detects which providers need to be included at runtime.

    No changes are required to take advantage of this new functionality. However, the providers Terragrunt input no longer has any functionality and the providers array can be removed from all module.yaml files. If this leaves a module.yaml empty, the entire module.yaml file can be deleted.

Added

  • Adds common_env_from_config_maps and common_env_from_secrets inputs to all standard workload submodules to provide the capability to source environment variables from existing ConfigMaps and Secrets, respectively.

  • kube_pg_cluster and kube_redis_sentinel now support using Vault-generated credentials to authenticate from other workloads. See the module documentation for more information.

Fixed

  • Adds a controller node preference to pods with controller_nodes_enabled set to true. This optimizes resource efficiency in the cluster as we should prefer to fill controller (EKS) nodes before Karpenter nodes as controller nodes are not automatically scaled.

edge.24-09-10

Breaking Changes

  • Karpenter has updated its CRD specification which unfortunately requires manual intervention during the upgrade process. After updating the pf_stack_version for any deployments of the kube_karpenter_node_pools module, run the following commands in the kube_karpenter_node_pools folder:

    pf-providers-enable
    terragrunt state rm kubernetes_manifest.default_node_class \
      kubernetes_manifest.spot_node_class \
      kubernetes_manifest.burstable_node_class \
      kubernetes_manifest.burstable_node_pool \
      kubernetes_manifest.burstable_arm_node_pool \
      kubernetes_manifest.spot_node_pool \
      kubernetes_manifest.spot_arm_node_pool \
      kubernetes_manifest.on_demand_arm_node_pool \
      kubernetes_manifest.on_demand_node_pool
    terragrunt apply --auto-approve
    kubectl delete nodepools burstable burstable-arm on-demand on-demand-arm spot spot-arm
    kubectl delete ec2nc spot burstable on-demand
    

    The kubectl delete commands may take a few minutes to complete as this will force all pods to be rescheduled from nodes created using the old CRDs to nodes created using the new CRDs.

  • The ports input on kube_deployment and kube_stateful_set has been moved to a container-level field rather than a top-level field to better align with the Kubernetes API.

Added

  • Adds a new submodule, kube_service, for defining Kubernetes Services that are optimized for the Panfactum Stack. Additionally, integrates kube_service into kube_deployment and kube_stateful_set for automatic Service creation.

  • Adds extra_storage_classes input to the kube_aws_ebs_csi module.

Fixed

  • Addressed issue in kube_pg_cluster where non-superuser credentials created by Vault would not have access to database schemas other than public.

  • Addressed issue where our Terragrunt configuration would cause the version pinning for the goauthentik/authentik and alekc/kubectl infrastructure providers would be removed. This would cause issues to occur when users ran terragrunt init -upgrade to update their lockfiles.

edge.24-09-04

Breaking Changes

  • Before applying this release, the buildkit-amd64 and buildkit-arm64 StatefulSets in the buildkit namespace must be removed (if kube_buildkit is deployed).

  • In preparation for our upcoming release, we cleaned up a handful of naming conventions which impact the inputs and outputs of several modules:

    • In kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility:
      • ready_check_ prefixed fields have been changed to readiness_probe_ to better align with the actual Kubernetes API.
      • liveness_check_ prefixed fields have been changed to liveness_probe_ to better align with the actual Kubernetes API.
      • image and image_version have been replaced with image_registry, image_repository, and image_tag to provide a clearer description of each constituent part and better align with ecosystem conventions.
      • secrets has been renamed to common_secrets to better align with its counterpart, common_env.
      • pod_annotations has been renamed to extra_pod_annotations to better align with its counterpart, extra_pod_labels.
      • readonly and has been renamed to read_only to better align with our casing conventions.
      • read_only_root_fs has been renamed to read_only for better consistency across modules.
      • instance_type_anti_affinity_required has been renamed to instance_type_spread_required to better reflect that the underlying mechanism is a pod topology spread constraint.
      • topology_spread_enabled has been renamed to az_spread_preferred to better reflect actual behavior.
      • topology_spread_required has been renamed to az_spread_required to better reflect actual behavior.
      • zone_anti_affinity_required has been renamed to az_anti_affinity_required to better align naming conventions with other settings that control scheduling based on availability zone.
    • Renamed Panfactum-provided priority classes to improve semantics (see docs).
    • In kube_pg_cluster and kube_redis_sentinel:
      • spot_instances_enabled, arm_instances_enabled, and burstable_instances_enabled have been changed to spot_nodes_enabled, arm_nodes_enabled, and burstable_nodes_enabled to better align with the inputs of other modules.
    • In kube_constants, a few outputs have been updated:
      • panfactum_image has been renamed to panfactum_image_repository to better align with naming conventions in other Panfactum modules
      • panfactum_image_version has been renamed to panfactum_image_tag to better align with naming conventions in other Panfactum modules
  • We have removed a handful of options from kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility that we would never recommend using:

    • prefer_spot_nodes_enabled, prefer_burstable_nodes_enabled, prefer_arm_nodes_enabled: These scheduling preferences are unnecessary as Karpenter will already prefer the cheapest nodes.
    • az_anti_affinity_preferred: az_spread_preferred should be used instead.
  • When we introduced the concept of the enhanced_ha_enabled input, it was designed as a cost-saving switch for direct modules where users do not need to have a deep understanding of the internals. However, it has also found its way into some submodules where it has created ambiguity about module behavior, especially since its impact differs module-to-module. As a result, we have replaced the enhanced_ha_enabled input in all submodules with more granular tuning knobs that have clearer behavior. This impacts the following submodules: kube_pg_cluster, kube_redis_sentinel, kube_vault_proxy, kube_argo_event_bus, and kube_argo_event_source.

  • Nodes managed by EKS Node Groups (vs Karpenter) are now tainted with controller=true:NoSchedule. We have added this taint as pods scheduled on these nodes might be disrupted regardless of their PDBs during EKS upgrades. For some workloads this could cause a disruption. Most workload submodules have a new input, controller_nodes_enabled, that can be used to allow your workloads to tolerate this taint if desired.

  • Previously we were conservative about enabling certain features by default in some of our submodules in order to ensure our modules would be compatible with non-Panfactum Kubernetes clusters. However, this is a very niche use case, and we have observed that this results in extra mental overhead for our normal users to avoid missing out on the core features provided by the Panfactum Stack. As a result:

Added

  • Adds built-in default downward-api integrations in all our workload submodules.
  • All mounted ConfigMaps and Secrets in our workload submodules are now mounted as executable to make it easier to mount scripts.

Fixed

  • Updates Karpenter and EBS CSI Controller to prevent any remaining edge cases where nodes were terminated prior to EBS volumes being detached which would result in six-minute delays for rescheduling stateful pods.
  • Remove the RemoveDuplicates strategy in kube_descheduler as users expect to be able to schedule multiple pods of the same controller on the same node when they set host_anti_affinity_required to false.

edge.24-08-27

Breaking Changes

  • We removed the ability to disable S3 backups in kube_pg_cluster. The backups have an extremely low cost impact and significantly improves the durability of data. Moreover, the continuous WAL archiving provided by the backups improves our system's ability to automatically recover in the case of failover events.

    Ultimately, we found that the risk of misuse (resulting in unexpected data loss or downtime) significantly outweighed any potential benefits gained by providing this functionality.

Added

  • Added native support for restoring from database backups to the kube_pg_cluster submodule.

  • Added automatic creation of an immediate base backup to the kube_pg_cluster to ensure that new databases can be recovered all the way up to their point of creation.

Fixed

  • Mitigated a rare scenario where disruption in the middle of a database failover would result in the PostgreSQL databases being unable to restart without manual intervention in the kube_pg_cluster submodule.

  • Fixed an issue where pf-get-repo-variables would provide the wrong directory for the root of the repository when run inside a downloaded .terragrunt-cache directory.

edge.24-08-24

Fixed

  • Addressed a couple of issues with the kube_authentik module:
    • authentik_core_resources will no longer fail to apply and end up in an invalid state when first created.
    • Authentik should no longer experience any downtime during database failover events

edge.24-08-23

Fixed

  • Correctly sets PgBouncer permissions on new PostgreSQL cluster creation in kube_pg_cluster.

edge.24-08-22

Breaking Changes

  • The default behavior of kube_redis_sentinel was to use both Redis AOF and RDB for persistence. Unfortunately, using AOF concurrently with RDB negates Redis' the ability to do partial resynchronizations after restarts and failovers. Instead, a full copy of the entire database must be transferred from the current master to replicas on every restart. This greatly increases the time-to-recover as well as incurs a high network cost.

    In fact, there is arguably no benefit to AOF-based persistence with our replicated architecture as new Redis nodes will always pull their data from the running master, not from their local AOF. The only benefit would be if all Redis nodes simultaneously failed with a non-graceful shutdown (an incredibly unlikely scenario).

    As a result, we have switched the module to use only use RDB for persistence, and the redis_appendfsync input has been removed. The module still provides the ability to provide custom redis configuration, so you can re-enable AOF if you want (though we would not advise it).

  • token_lifetime_seconds has been changed to token_lifetime_hours in vault_auth_oidc to avoid a perpetual diff issue present in the Vault provider.

  • Removed the daily backups from kube_velero as they were undocumented and had no realistic use case.

Added

  • Adds a new submodule, kube_disruption_window_controller, which can be used to specify time-based disruption windows for disruption-sensitive workloads (e.g., databases). Disruption window capabilities have also been added to kube_pg_cluster and kube_redis_sentinel.

  • Adds synchronous replication support to kube_pg_cluster via pg_sync_replication_enabled.

Fixed

  • Addressed issue where pg_smart_shutdown_timeout cannot be set to 0 in kube_pg_cluster without having CNPG reset it to 180.

  • Fixed an issue in kube_velero where stale EBS snapshots were not being deleted.

  • Added stricter disruption prevention to the Velero server in kube_velero as disrupting the server in the middle of a backup operation would cause it to fail and not be resumed.

edge.24-08-15

Breaking Changes

  • pg_shutdown_timeout has been renamed to pg_smart_shutdown_timeout to better indicate its purpose in kube_pg_cluster. Additionally, the shutdown and failover logic has been overhauled. The new default will immediately terminate running queries when a database pod is killed, but this serves to reduce the downtime from 60-120 seconds to < 5 seconds in the failover scenario. Please see the module documentation for more information.

Added

  • Adds the concept of passthrough parameters to wf_spec.

  • Makes tf_apply_dir a Workflow parameter in wf_tf_deploy so that you only need a single instance of this module per cluster.

  • Adds the ability to use templateRef to compose Workflows in wf_spec.

Fixed

  • Fixed the working directory in wf_tf_deploy and wf_dockerfile_build to be inside the cloned repository.

  • Addressed OOM errors when using resource templates with wf_spec.

edge.24-08-13

Breaking Changes

  • pg_storage_increase_percent has been changed to pg_storage_increase_gb in kube_pg_cluster. This allows for more predictable storage autoscaling and optimal resource provisioning regardless of the current storage scale.

  • pg_storage_gb has been changed to pg_initial_storage_gb in kube_pg_cluster. This better indicates that this value is only used during the initial database provisioning and has no effect thereafter.

  • node_vpc_id, node_subnets, and node_security_group_id have been moved from kube_karpenter to kube_karpenter_node_pools in order to simplify the logic of assigning nodes to subnets, VPCs, and security groups. Additionally, we have removed Karpenter auto-discovery tags as they are no longer necessary.

Added

  • Adds new enhancements to the kube_pg_cluster module:

    • Better defaults and options for memory tuning
    • Provides the ability to set arbitrary PostgreSQL parameters
    • Provides the ability to set a custom backup schedule
    • Adds support for additional schemas via the extra_schemas input
  • Adds another local retry for Terragrunt when providers produce an inconsistent final plan.

  • Adds check for an updated direnv version to prevent issues when setting up the local devenv.

Fixed

  • Added deterministic ordering to additional resources in authentik_core_resources.

  • Fixed the following bugs in pf-env-bootstrap:

    • Would use a non-existent AWS profile for the .sops.yaml file.
    • Would not install all the platform checksums in the .terraform.lock.hcl files.
  • amd64 nodes are now used when bootstrapping_enabled is true in aws_eks in order to allow certain bootstrapping tests (e.g., Cilium) to run successfully.

  • Restores the pf-db-tunnel command to the devenv.

  • pf-get-version-hash local now properly returns local without an error code.

  • Updates the Panfactum image version in kube_constants to a version that is compatible with the latest pre-built workflows.

edge.24-08-12

Breaking Changes

  • Repository variables must now be defined in a panfactum.yaml file located at the root of your repository instead of in your devenv.nix. Additionally, the variables names are no longer prefixed with PF_ and are lowercase.

    For example, env.PF_REPO_NAME in devenv.nix should now be defined at repo_name in panfactum.yaml.

    This change was made to make it easier to reference these values outside of local development contexts such as within CI pipelines where devenv.nix isn't loaded.

Added

  • We have provided two new addons, a Workflow Engine (Argo Workflows) and an Event Bus (Argo Events).

  • We have created a guide and best practices for setting up CI / CD in the Panfactum Stack.

  • To support the new addons, we are upgrading the following infrastructure modules to Beta status:

  • Adds pf-get-repo-variables which prints a JSON payload of all repository configuration variables with the appropriate defaults set.

edge.24-07-08

Breaking Changes

  • We have made a small, breaking refactor of aws_eks to reduce unnecessary options that made onboarding and maintenance more difficult:

    • Most importantly, users will no longer able to set the instance type and count for nodes in EKS node groups. This flexibility is unnecessary since node provisioning is handled by Karpenter and not EKS. Moving forward, there are just two static configurations that are guaranteed to work in all use cases: one for before autoscaling is installed and one for after. This is controlled via the new input, bootstrap_mode_enabled (default: false).
    • control_plane_version and controller_node_kube_version have been unified into a single variable called kube_version that applies to all subsystems.
    • controller_node_subnets has been renamed to node_subnets to indicate these subnets are used for all cluster nodes, not just the EKS node groups.
    • all_nodes_allowed_security_groups has been renamed to node_security_groups to align naming conventions
  • By default, PVCs created by controllers such as StatefulSets can not be updated through their controller as their template (volumeClaimTemplates) is immutable (a Kubernetes limitation). This poses a challenge when needing to update PVC metadata such as annotations and labels. We have built a workaround to this (kube_pvc_annotator) and incorporated it in various Panfactum modules. Unfortunately, incorporating this enhancement requires redeploying StatefulSets.

    To complete this upgrade, perform the following steps:

    1. Create a Velero backup of the cluster by running velero create backup -w <backup_name> to recover in case of mistakes.

    2. The following StatefulSets need to be deleted in this order AND with kubectl delete --cascade=orphan AND immediately restored with a subsequent terragrunt apply to their defining module:

      • The Vault StatefulSet created by kube_vault
      • The Redis cluster StatefulSet for Authentik created by kube_authentik
      • The BuildKit StatefulSets created by kube_buildkit
      • Any StatefulSets you have provisioned with kube_stateful_set
      • Any Redis clusters StatefulSets you have provisioned with kube_redis_sentinel

      As long as you use --cascade=orphan and take care to minimize the time between the kubectl delete and terragrunt apply, there will not be any downtime during this operation.

    3. After completing this operation, you need to delete the backing PVCs from each module one at a time by deleting the PVC and then deleting its bound pod. The controller will then automatically provision a new PVC with the correct labels and annotations to take advantage of the new functionality.

      After deleting each pod, ensure that a new pod is automatically provisioned and becomes healthy before proceeding to the next. As long as you proceed one at a time, this will not cause any downtime or data loss.

    4. Delete the Velero backup you created in step 1 by running velero delete backup <backup_name>.

Added

  • Adds kube_fledged to the core stack. The kube-fledged controller adds the ability to pre-pull images to every node to improve pod startup times for critical or frequently used containers such as the Linkerd proxy or database images. We provide instructions for installing this module here

  • Adds the kube_pvc_annotator submodule that will provision a CronJob to run pf-set-pvc-metadata against PVCs created by immutable templates. See the module documentation for potential use cases.

  • Adds persistence_backups_enabled (default: true) to kube_redis_sentinel to support disabling EBS snapshot backups.

  • Adds a new common variable, node_image_cache_enabled, to Panfactum modules that can be used to enable pre-pulling images to nodes via the kube_fledged operator.

  • Adds the pf-buildkit-clear-cache command for removing any BuildKit caches not being used by an active image build job.

  • Adds the pf-set-pvc-metadata utility command for syncing labels and annotations across groups of PVCs.

Fixed

  • Fixes handling of public ECR registries in docker-credential-panfactum.

  • Fixes handling of ECR token caching in docker-credential-panfactum.

  • Fixes pf-get-open-port to be platform-agnostic.

  • Fixes pf-get-version-hash to work with commit hash inputs.

  • Fixes image paths in the Authentik dashboard for applications provisioned by Panfactum modules.

edge.24-07-01

Breaking Changes

  • The input format to aws_ecr_repos has been reformatted to support better per-repository configuration. This should not require replacing any resources, but it will require updating your Terragrunt inputs.

  • The following resources will no longer be tagged with the Panfactum version and commit hash as updates cause unnecessary delays and disruptions during updates for little added value:

Added

  • kube_buildkit has graduated to beta and is now ready for general consumption. This is the first stack addon that can be used to extend the behavior of the core stack. Installation and usage instructions can be found here.
  • aws_ecr_repos now supports custom image expirations rules and both pull and push permissions.

  • aws_ecr_public_repos has been added to support created public ECR repositories.

  • Adds ARM support in kube_bastion and kube_pvc_autoresizer. All core cluster components can now be run on both amd64 and arm64 nodes allow for optimal cost savings.

  • Changes the default securityContext.fsGroupChangePolicy to OnRootMismatch for Pods created by Panfactum submodules in order to improve PVC mounting performance.

  • pf-providers-enable now ensures that .terraform.lock.hcl files have all common platform checksums.

  • Adds pf-get-terragrunt-variables which can be used to derive the Terragrunt variables that would be used if Terragrunt were run in the given directory.

  • Adds pf-tf-delete-locks which can be used to bulk-release Tofu state locks.

  • Adds pf-sops-set-profile which will update all sops-encrypted files in the given directory to use the indicated AWS profile for KMS operations. This can be used in CI pipelines to allow the CI user to access sops-encrypted files.

Fixed

  • kube_aws_ebs_csi has been adjusted to ensure that PVCs are detached from nodes during node shutdown, preventing unnecessary delays in moving PVCs between nodes.

  • kube_core_dns no longer accidentally includes the Vault provider.

  • kube_ingress_nginx will no longer unnecessarily set browser security headers on 3xx responses or responses that do not have Content-Type headers.

edge.24-06-20

Breaking Changes

  • kube_karpenter has upgraded the Karpenter version to v0.37. During this release cycle, the Karpenter team moved the CRDs required by Karpenter to a dedicated Helm chart to improve the upgrade ergonomics. Unfortunately, this introduces a few one-time manual steps that you must perform to enable the migration. Specifically, the following commands must be run against your cluster before applying the latest version of kube_karpenter:

    kubectl label crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh app.kubernetes.io/managed-by=Helm --overwrite
    kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-name=karpenter-crd --overwrite
    kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-namespace=karpenter --overwrite
    
  • kube_karpenter_node_pools has a new input node_labels which defines what labels will be applied to generated nodes. The standard Panfactum labeling system will no longer apply to Karpenter nodes due to this upstream issue.

  • The persistence_enabled option was removed from kube_redis_sentinel. Redis is now always deployed with persistence enabled. This decision was made b/c the cross-AZ network costs of re-instantiating Redis nodes without PVC storage dwarf the costs of the PVC storage (by a factor of 100x). As a result, there is no benefit to not periodically saving the redis database to a persistent disk.

    To compensate for potential performance impacts, we have exposed another input, redis_appendfsync. Setting this to "no" will achieve the same performance as having persistence disabled. However, the default setting of "everysec" is likely sufficient for the vast majority of use cases and reduces the risk of data loss.

    Unfortunately, if you were previously running with persistence_enabled set to false, you will need to delete the Redis StatefulSets in order to apply the new module.

    In particular, this impacts the kube_authentik module. Before deleting the Redis StatefulSet for Authentik, ensure your Vault token is not expired as you will not be able to re-authenticate with Authentik while the Redis StatefulSet is removed.

    Since persistence_enabled should only have been used in scenarios where data retention was not important, this should be considered a safe operation. However, it will introduce a minor service disruption during the replacement period.

  • aws_ecr_pull_through_cache_addresses has been refactored to improve the ergonomics of using the module. It now requires an input, pull_through_cache_enabled, and will output the correct registry names regardless of whether using a pull through cache or not.

Added

  • Adds the pf-providers-enable command that will automatically inspect the source infrastructure module and enable the required providers in a module's module.yaml.

  • Adds the pf-update-iac command that will update first-party infrastructure modules in the following ways:

    • Executes the templating directives.

    • Updates the ref in sourced Panfactum submodules to the commit hash of the devenv if the # pf-update annotation is provided. See the documentation for more details.

  • Adds phone number validation in aws_account.

  • Adds cors_enabled (default: false) input variable to kube_vault that can enabled CORS handling.

    This can be useful when building web applications that interact with Vault in client-side JavaScript. By default, this will allow CORS requests from all sibling and child domains.

Fixed

  • Addresses an issue in kube_authentik that prevented the SSO login pop-up from working.

  • Implements custom CORS handling logic in kube_ingress that resolves issues in the default behavior provided by the NGINX ingress controller.

  • Removes invalid failure cases when using pf-get-vault-token in Terragrunt and improve failure messaging.

  • Fixes an issue that occurs when the kubernetes provider is enabled but the sourced module does not use the kubectl provider.

  • Fixes failure cases in pf-env-scaffold and adds more debug logging.

edge.24-06-14

Added

  • Adds kube_scheduler, an alternative Kubernetes scheduler that can be used to improve bin-packing of pods on nodes in the Kubernetes cluster. This allows for better, smaller node selection and our tests show an estimated 25-33% reduction in node costs when used. We provide instructions for installing it here.

  • Adds panfactum_scheduler_enabled (default: false) input to most infrastructure modules. When enabled, will use the scheduler provided by kube_scheduler instead of the less-efficient EKS scheduler.

  • If panfactum_scheduler_enabled is true, the kube_descheduler will automatically remove pods from low utilization nodes to allow the kube_scheduler to bin-pack them on other nodes.

Fixed

  • Addresses a bug in the previous release that left kube_karpenter not deployable.

  • Addresses an issue where nodes were limited a hard cap of 29 pods.

  • Configures Kubernetes nodes to use a fixed amount of system overhead rather than one that scales unnecessarily with node size.

edge.24-06-13

Added

  • Updates kube_pg_cluster with many new variables for configuring PgBouncer. New variables are prefixed with pgbouncer_.

  • Adds support for path_prefix to kube_vault_proxy (@mschnee)

  • Adds new enhanced_ha_enabled input to many core modules (default true). Setting this to false will allow for additional cost savings (approximately $50 / month) in exchange for introducing a small possibility of temporary outages. We estimate that setting this to false reduces availability from 99.995% to 99.9%. This can be used to decrease costs in less critical clusters (e.g., development).

  • Adds a Spot Data Feed to the aws_account module.

  • Adds the kube_open_cost module for calculating the cost of workloads running on Kubernetes.

Fixed

  • Addressed issue in aws_vpc where NAT nodes wouldn't restart if NAT setup failed with an exit code other than 1.

  • Increased the memory floor of the Authentik server in kube_authentik to avoid OOM issues.

  • Updates kube_authentik to allow showing Gravatar profile images.

  • Updates kube_authentik to provide the necessary Permissions-Policy headers to allow use of WebAuthn devices.

  • Correctly applies pod labels in kube_aws_lb_controller.

  • Removes node preferences defaults from kube_workload_utility that were preventing efficient node deprovisioning.

  • Adjusts the VPA recommendation overhead from 30% to 15% to improve resource utilization.

  • Fixes incorrect SCIM property mapping in authentik_aws_sso.

  • Aligns pod labels, affinities, topologySpreadConstraints, and tolerations in kube_linkerd to conventions used in all other modules.

edge.24-06-08

Added

  • Updates aws_vpc to support new command pf-vpc-network-test that will verify network connectivity properties of the instantiated VPC. This allows us to simplify an otherwise complex validation step in the bootstrapping guide.

  • Adds the pf-env-bootstrap command that automatically bootstraps the necessary resources to begin working with IaC in an environment. This replaces the manual steps that used to be a part of the bootstrapping guide.

  • Adds new extra_inputs terragrunt variable that allows you to pass inputs to all modules in the current scope.

  • Adds arm64 NodePools and arm64 support for the core components. This reduces the cost of running the base stack by $25 - 50 / month due to significantly better price / performance ratios for arm64 instances in AWS.

  • Sets unhealthyPodEvictionPolicy to AlwaysAllow for all module PDBs. This will allow the system to scale up quicker when running against resource pressure and pods become stuck in a temporary crash loop.

  • Sets maximum node lifetime to 24h to force Karpenter to try to consolidate instances at least once per day.

Fixed

  • Addressed issue where the aws-ebs-csi-driver DaemonSet pods would not be properly terminated by Karpenter during node shutdown. This resulted in EBS volumes not being detached and introduced an unnecessary 6min delay when moving EBS volumes between nodes.

  • Replaces most usages of kubernetes_manifest with kubectl_manifest to avoid type manifest parsing issues that prevent dynamic values in manifests.

edge.24-06-06

Breaking Changes

  • kube_trust_manager has been deprecated as it's functionality was redundant with kube_reflector. We are keeping the module in the repo to support backwards compatibility, but it will be removed in the future. You should perform the following steps to remove it:
    • Apply this release.
    • Remove any dependency blocks to it in your terragrunt.hcl files.
    • Run terragrunt destroy on the module to remove it.
    • Delete the bundles CRD.

Added

  • aws_registered_domains can now set the contact type for each contact.

  • Allow users to reference availability zones by single character (e.g., a) in addition to the full name (e.g., us-east-2a) in the aws_vpc module.

  • The manual steps needed to reset new EKS clusters to a clean slate during the bootstrapping guide have been consolidated into a single new command, pf-eks-reset.

Fixed

  • Addressed issue in aws_vpc that caused a temporary, harmless error to crash the terragrunt apply on initial bootstrapping.

  • Fixed issue where Cilium test suites would fail during bootstrapping due to a NetworkPolicy blocking the kube_core_dns module.

edge.24-06-04

Breaking Changes

  • The reloader deployment must be deleted before the next apply of kube_reloader. No inputs have changed.

  • The alpha module kube_labels has been removed in favor of the labels provided by kube_workload_utility.

  • VPC flow logs in aws_vpc are now disabled by default as they can be fairly expensive and should only be used if you have a specific use-case in mind. They can be enabled by setting vpc_flow_logs_enabled to true.

Added

  • Added new pf-env-scaffold script that takes care of setting up the PF_ENVIRONMENTS_DIR in the bootstrapping guide section for setting up terragrunt.

  • Added kube_workload_utility to make it easier to create uniform, production-hardened Pod specs that take advantage of all capabilities included in the Panfactum stack.

  • A new standard label panfactum.com/workload can be used to group replicated pods for the purpose of aggregating metrics. This is now applied in all core infrastructure modules.

  • Added kube_constants that export static configuration values that can be useful when creating resources that run on clusters in the Panfactum stack.

  • kube_cert_manager will now automatically delete Certificate secrets if the Certificate is deleted.

  • aws_ses_domain now takes an optional input smtp_allowed_cidrs that restricts what IPs can use the generated SMTP credentials. This allows users to mitigate credential exfiltration attacks. We provide an example of how to use this here.

  • The Vault login UI will now have the OIDC login as the default method.

  • Terragrunt will now automatically retry on some errors up to three times before exiting the process with a failure. This should address intermittent issues such as network disruptions or race conditions.

Fixed

  • .env files are now properly loaded into the shell environment and changes will trigger fast reloads instead of full devenv re-evaluations.

  • Temporarily adds GIT_CLONE_PROTECTION_ACTIVE=false to the shell environment in order to address this issue. Note that this only disables new bleeding edge security features which were accidentally shipped in a broken state.

  • Adjusts base resource requests of core infrastructure modules to prevent temporary OOM errors when bootstrapping before VPA take effect.

  • kube_authentik now respects log_level input.

  • Sets max_history to 5 for all Helm charts to prevent overloading the Kubernetes API server with an every-growing amount of historical Helm deployments.

edge.24-06-02

Breaking Changes

  • Upgraded to devenv 1.0. As a part of this upgrade, .env file values can no longer be referenced directly inside .nix files.

Added

  • Updated kube_redis_sentinel to automatically limit client buffer size to prevent OOM issues when processing very bursty traffic.
  • Added pf-update command that runs all the repository scaffolding commands at once.

Fixed

  • Addressed an issue that caused updates to the local devenv to take at least 10 minutes rebuild on macOS. Rebuilds should now be 10-15x faster, but they will still take about 45 seconds at minimum. Note that this only impacts rebuilds and not normal direnv load times which should still be instant.

    This is a known limitation of upstream nix's derivation evaluation caching when using flakes. We expect this to be addressed when flakes reach stability.

  • Added missing defaults for PF_ENVIRONMENTS_DIR and PF_IAC_DIR.

  • Resolves an issues where devenv warnings could not be resolved during the initial bootstrapping guide.

  • Added extra validation for the terragrunt variable extra_tags. Invalid characters will now be replaced with . for both keys and values for both Kubernetes labels and AWS tags.

  • Fixed some core components that were using all Kubernetes labels for labelSelector matching rules which prevented Karpenter from autoscaling when extra_tags was provided. This previously manifested as the error spec.requirements: Too many: #: must have at most 30 items.

  • Added extra constraints to kube_external_dns to prevent it from attempting to query zones that it isn't managing.

  • Prevented kube_external_dns from excluding parent domains of included domains.

edge.24-05-30

Breaking Changes

  • The default for vault_storage_size_gb in kube_vault has been changed from 20 to 2 in order to improve resource utilization. If you created Vault with the old default, you will need to manually set vault_storage_size_gb to 20 as volume sizes cannot be reduced after creation.

Added

  • (Alpha) Added the Loki logging backend via kube_logging and the Alloy log collector via kube_alloy.

  • The PVC Autoresizer has been added via the kube_pvc_autoresizer module in order to automatically expand EBS volumes as they fill up. We provide the guide for deploying it here.

Fixed

  • Resolved issue where scheduling constraints could not be resolved for components deployed before Karpenter (#41)

edge.24-05-23

Breaking Changes

  • We have removed the EKS CoreDNS addon and replaced it with the kube_core_dns module in order to provide better guarantees about the behavior of DNS in the Panfactum stack. In order to migrate:

    1. Add the dns_service_ip input to aws_eks deployments by following this guide. Double check that the dns_service_ip is the same IP as defined by kube-system/kube-dns.

    2. Additionally, set core_dns_addon_enabled to true.

    3. Apply the updated module aws_eks module.

    4. Add the cluster_dns_service_ip input to your kube_karpenter_node_pools module like this, and re-apply the module. Ensure that all of your nodes have been replaced with the new configuration.

    5. Deploy kube_core_dns by following this guide. Note that this deployment will fail as the original addon service is still running and the IP is already taken.

    6. Delete kube-system/kube-dns and re-apply kube_core_dns. Note that while the service is deleted, DNS will be temporarily unavailable in your cluster.

    7. Once you've validated that DNS is working in the cluster, remove the core_dns_addon_enabled input from the aws_eks module and re-apply.

  • We have stabilized the label selectors in kube_pod but this requires one final label update for already-deployed Deployments. This will cause re-applies of kube_bastion to fail (and any first-party modules that rely on kube_deployment). To resolve, you must first manually delete the bastion/bastion deployment (and all other deployments created by kube_deployment).

  • kube_pg_cluster has two new flags, pgbouncer_read_only_enabled (default false) and pgbouncer_read_write_enabled (default true), which will enable the r and rw poolers, respectively. This will enable users to better control what is deployed so as not to have idle resources. This is a breaking change as pgbouncer_read_only_enabled is set to false by default.

Added

  • (Alpha) We've added a monitoring stack kube_monitoring which includes HA Prometheus, the Prometheus Operator, Thanos metrics storage on S3 (with deduplication, caching, and down-sampling), the Node Exporter, kube-state-metrics, Alertmanager, and Grafana (with SSO enabled and 20+ custom dashboards).

    Additionally, most modules now have an additional monitoring_enabled (default false) flag that can be turned on to being shipping data to Prometheus for viewing and querying via Grafana.

  • (Alpha) kube_cilium now has a new debugging mode, hubble_enabled (default false), that will capture extensive TCP-level metrics about the cluster as well as expose a debugging UI via HTTPS.

  • (Alpha) kube_linkerd now deploys Linkerd Viz when monitoring_enabled = true. This provides a service mesh dashboard and the ability to capture and introspect raw HTTP requests sent in realtime.

  • (Alpha) We've added the Argo Workflow engine to the stack via the kube_argo module. This will serve as the basis for the future, integrated CI / CD systems and can also be used to process arbitrary events from event queues such as AWS SNS/SQS and Kafka. (@jlevydev)

  • A new module, kube_vault_proxy, that can be used to add SSO to web assets that do not have integrated SSO. The module SSO is configured out-of-the-box to work with the cluster's Vault instance.
  • We've included a new Kubernetes provider, kubectl, to augment the original kubernetes provider. The kubectl provider allows more flexibility in deploying raw Kubernetes manifests which is required by our templating system. This provider will automatically be enabled the kubernetes provider is enabled, so no additional changes are required from end users.

  • kube_redis_sentinel has a new flag, lfu_cache_enabled, that will configure the Redis cluster automatically evict records under memory pressure based on an approximated Least Frequently Used algorithm.

Changed

  • Added the standard Restricted Reader role to Vault instances (rbac-restricted-reader) and updated vault_auth_oidc to take restricted_reader_groups. Since cluster resources authenticate with SSO via Vault, this allows restricted readers to access additional cluster resources such as Grafana and Argo Workflows (albeit, in a locked-down read-only mode).
  • Disabled evictions of database pods based on max lifetimes. This improves the stability of databases deployed by Panfactum modules.

  • After completing the bootstrapping guide, we now recommend that users update their aws_eks cluster modules to have controller_node_count set to 1 and controller_node_instance_types set to ["t3a.medium"]. This will decrease the costs of the base cluster by about 40% without impacting cluster availability or resiliency. The single remaining node is used primarily as a place for Karpenter to run (Karpenter cannot run on instances that it itself provisions).

  • kube_karpenter now only deploys a single instance of Karpenter and enforces that it is run on a controller node. This reduces the overall resource utilization of this fairly heavyweight controller.

  • Kubernetes labels applied via the extra_tags terragrunt input are now sanitized for valid characters automatically (invalid characters are replaced with .). (@mschnee)
  • Added scheduling constraints to prevent critical workloads from scheduling all pods on the same instance type in order to minimize the possibility of disruption on events that only affect one instance type (e.g., spot node preemption).

  • Changes many other non-critical core controllers to only have a single replica when 100% uptime is not necessary in order to reduce resource utilization in the Stack.

  • Updates many controller deployments to use the Recreate deployment strategy to improve timing and efficiency of applying Panfactum upgrades.

  • kube_vpa has a new history_length_hours (default 24) that will control how far back it will analyze metrics for computing its recommendations.

Fixed

  • PVCs for postgres instances were inadvertently created with duplicated entries for accessModes. This has no functional impact, but confused monitoring systems. This has been fixed, but the fix will not retroactively adjust existing PVCs as they are immutable.

edge.24-05-15

Breaking Changes

  • kube_vault now takes vault_domain as an input instead of environment_domains. This change was made as having multiple domains for Vault is incompatible with using Vault as an intermediary IdP.

Added

  • New kube_reflector module for deploying the Reflector in order to synchronize ConfigMaps and Secrets across namespaces. Created a new guide section for deploying the module as a part of the foundational Stack.

  • pg_shutdown_timeout variable to kube_pg_cluster to control how long the postgres instances will wait for active connections to close before shutting down.

Fixed

  • Fixed an issue where simultaneous, graceful shutdown of all postgres nodes in a kube_pg_cluster would cause unnecessary downtime when the primary was running on a spot instance.

edge.24-05-12

The initial edge release of the Panfactum stack!