Edge Releases
Edge releases do not receive patches nor make any backwards compatibility guarantees. Learn more here.
To use stable Panfactum releases, please see our available licenses.
To upgrade your Panfactum stack version, please follow the instructions in the upgrade guide.
edge.24-09-12
Breaking Changes
-
The kube_secrets_csi has been deprecated and should be removed from your clusters. It was primarily used for managing dynamically generated Vault secrets such as database credentials, but we have switched to a new paradigm that uses the Vault Secrets Operator.
This saves approximately 150MB of memory per node in the cluster, improves security by removing pods that needed elevated host-level permissions, and provides better ergonomics for managing dynamically generated secrets in our modules.
-
kube_pg_cluster's and kube_redis_sentinel's
superuser_username
andsuperuser_password
outputs have been renamed toroot_username
androot_password
, respectively. We made this change because "superuser" implies Vault-generated credentials, which these are not. -
pf-providers-enable
has been renamed topf-tf-init
as it now has expanded functionality:- Now influences every module in the directory tree where it is run rather than just the module in the CWD.
- Now runs
init -upgrade
on every module to update provider versions and download internal submodules when performing Panfactum version upgrades. - The runtime speed has been improved in order to accommodate running against many modules at once.
We have updated the upgrade guide to reflect that
pf-tf-init
should be run every time you upgrade the Panfactum version in an environment. -
You now no longer need to manually enable providers via the
providers
array in eachmodule.yaml
. Our Terragrunt configuration now automatically detects which providers need to be included at runtime.No changes are required to take advantage of this new functionality. However, the
providers
Terragrunt input no longer has any functionality and theproviders
array can be removed from allmodule.yaml
files. If this leaves amodule.yaml
empty, the entiremodule.yaml
file can be deleted.
Added
-
Adds
common_env_from_config_maps
andcommon_env_from_secrets
inputs to all standard workload submodules to provide the capability to source environment variables from existing ConfigMaps and Secrets, respectively. -
kube_pg_cluster and kube_redis_sentinel now support using Vault-generated credentials to authenticate from other workloads. See the module documentation for more information.
Fixed
- Adds a controller node preference to pods with
controller_nodes_enabled
set totrue
. This optimizes resource efficiency in the cluster as we should prefer to fill controller (EKS) nodes before Karpenter nodes as controller nodes are not automatically scaled.
edge.24-09-10
Breaking Changes
-
Karpenter has updated its CRD specification which unfortunately requires manual intervention during the upgrade process. After updating the
pf_stack_version
for any deployments of thekube_karpenter_node_pools
module, run the following commands in thekube_karpenter_node_pools
folder:pf-providers-enable terragrunt state rm kubernetes_manifest.default_node_class \ kubernetes_manifest.spot_node_class \ kubernetes_manifest.burstable_node_class \ kubernetes_manifest.burstable_node_pool \ kubernetes_manifest.burstable_arm_node_pool \ kubernetes_manifest.spot_node_pool \ kubernetes_manifest.spot_arm_node_pool \ kubernetes_manifest.on_demand_arm_node_pool \ kubernetes_manifest.on_demand_node_pool terragrunt apply --auto-approve kubectl delete nodepools burstable burstable-arm on-demand on-demand-arm spot spot-arm kubectl delete ec2nc spot burstable on-demand
The
kubectl delete
commands may take a few minutes to complete as this will force all pods to be rescheduled from nodes created using the old CRDs to nodes created using the new CRDs.
- The
ports
input on kube_deployment and kube_stateful_set has been moved to a container-level field rather than a top-level field to better align with the Kubernetes API.
Added
-
Adds a new submodule, kube_service, for defining Kubernetes Services that are optimized for the Panfactum Stack. Additionally, integrates
kube_service
into kube_deployment and kube_stateful_set for automatic Service creation. -
Adds
extra_storage_classes
input to the kube_aws_ebs_csi module.
Fixed
-
Addressed issue in kube_pg_cluster where non-superuser credentials created by Vault would not have access to database schemas other than
public
. -
Addressed issue where our Terragrunt configuration would cause the version pinning for the
goauthentik/authentik
andalekc/kubectl
infrastructure providers would be removed. This would cause issues to occur when users ranterragrunt init -upgrade
to update their lockfiles.
edge.24-09-04
Breaking Changes
-
Before applying this release, the
buildkit-amd64
andbuildkit-arm64
StatefulSets in thebuildkit
namespace must be removed (if kube_buildkit is deployed). -
In preparation for our upcoming release, we cleaned up a handful of naming conventions which impact the inputs and outputs of several modules:
- In kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility:
ready_check_
prefixed fields have been changed toreadiness_probe_
to better align with the actual Kubernetes API.liveness_check_
prefixed fields have been changed toliveness_probe_
to better align with the actual Kubernetes API.image
andimage_version
have been replaced withimage_registry
,image_repository
, andimage_tag
to provide a clearer description of each constituent part and better align with ecosystem conventions.secrets
has been renamed tocommon_secrets
to better align with its counterpart,common_env
.pod_annotations
has been renamed toextra_pod_annotations
to better align with its counterpart,extra_pod_labels
.readonly
and has been renamed toread_only
to better align with our casing conventions.read_only_root_fs
has been renamed toread_only
for better consistency across modules.instance_type_anti_affinity_required
has been renamed toinstance_type_spread_required
to better reflect that the underlying mechanism is a pod topology spread constraint.topology_spread_enabled
has been renamed toaz_spread_preferred
to better reflect actual behavior.topology_spread_required
has been renamed toaz_spread_required
to better reflect actual behavior.zone_anti_affinity_required
has been renamed toaz_anti_affinity_required
to better align naming conventions with other settings that control scheduling based on availability zone.
- Renamed Panfactum-provided priority classes to improve semantics (see docs).
- In kube_pg_cluster and kube_redis_sentinel:
spot_instances_enabled
,arm_instances_enabled
, andburstable_instances_enabled
have been changed tospot_nodes_enabled
,arm_nodes_enabled
, andburstable_nodes_enabled
to better align with the inputs of other modules.
- In kube_constants, a few outputs have been updated:
panfactum_image
has been renamed topanfactum_image_repository
to better align with naming conventions in other Panfactum modulespanfactum_image_version
has been renamed topanfactum_image_tag
to better align with naming conventions in other Panfactum modules
- In kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility:
-
We have removed a handful of options from kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility that we would never recommend using:
prefer_spot_nodes_enabled
,prefer_burstable_nodes_enabled
,prefer_arm_nodes_enabled
: These scheduling preferences are unnecessary as Karpenter will already prefer the cheapest nodes.az_anti_affinity_preferred
:az_spread_preferred
should be used instead.
-
When we introduced the concept of the
enhanced_ha_enabled
input, it was designed as a cost-saving switch for direct modules where users do not need to have a deep understanding of the internals. However, it has also found its way into some submodules where it has created ambiguity about module behavior, especially since its impact differs module-to-module. As a result, we have replaced theenhanced_ha_enabled
input in all submodules with more granular tuning knobs that have clearer behavior. This impacts the following submodules: kube_pg_cluster, kube_redis_sentinel, kube_vault_proxy, kube_argo_event_bus, and kube_argo_event_source. -
Nodes managed by EKS Node Groups (vs Karpenter) are now tainted with
controller=true:NoSchedule
. We have added this taint as pods scheduled on these nodes might be disrupted regardless of their PDBs during EKS upgrades. For some workloads this could cause a disruption. Most workload submodules have a new input,controller_nodes_enabled
, that can be used to allow your workloads to tolerate this taint if desired. -
Previously we were conservative about enabling certain features by default in some of our submodules in order to ensure our modules would be compatible with non-Panfactum Kubernetes clusters. However, this is a very niche use case, and we have observed that this results in extra mental overhead for our normal users to avoid missing out on the core features provided by the Panfactum Stack. As a result:
- The following flags are now enabled by default in kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, kube_pg_cluster, kube_redis_sentinel, and kube_workload_utility:
spot_nodes_enabled
arm_nodes_enabled
vpa_enabled
panfactum_scheduler_enabled
- The following flags are now enabled by default in kube_deployment:
az_spread_preferred
- The following flags are now enabled by default in kube_stateful_set:
az_spread_required
instance_type_spread_required
- The following inputs are now enabled by default in all modules:
pull_through_cache_enabled
- The following inputs are now enabled by default in all direct modules deployed after the autoscaling section in the bootstrapping guide:
vpa_enabled
panfactum_scheduler_enabled
- The following flags are now enabled by default in kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, kube_pg_cluster, kube_redis_sentinel, and kube_workload_utility:
Added
- Adds built-in default downward-api integrations in all our workload submodules.
- All mounted ConfigMaps and Secrets in our workload submodules are now mounted as executable to make it easier to mount scripts.
Fixed
- Updates Karpenter and EBS CSI Controller to prevent any remaining edge cases where nodes were terminated prior to EBS volumes being detached which would result in six-minute delays for rescheduling stateful pods.
- Remove the
RemoveDuplicates
strategy in kube_descheduler as users expect to be able to schedule multiple pods of the same controller on the same node when they sethost_anti_affinity_required
tofalse
.
edge.24-08-27
Breaking Changes
-
We removed the ability to disable S3 backups in kube_pg_cluster. The backups have an extremely low cost impact and significantly improves the durability of data. Moreover, the continuous WAL archiving provided by the backups improves our system's ability to automatically recover in the case of failover events.
Ultimately, we found that the risk of misuse (resulting in unexpected data loss or downtime) significantly outweighed any potential benefits gained by providing this functionality.
Added
-
Added native support for restoring from database backups to the kube_pg_cluster submodule.
-
Added automatic creation of an immediate base backup to the kube_pg_cluster to ensure that new databases can be recovered all the way up to their point of creation.
Fixed
-
Mitigated a rare scenario where disruption in the middle of a database failover would result in the PostgreSQL databases being unable to restart without manual intervention in the kube_pg_cluster submodule.
-
Fixed an issue where
pf-get-repo-variables
would provide the wrong directory for the root of the repository when run inside a downloaded.terragrunt-cache
directory.
edge.24-08-24
Fixed
- Addressed a couple of issues with the kube_authentik
module:
- authentik_core_resources will no longer fail to apply and end up in an invalid state when first created.
- Authentik should no longer experience any downtime during database failover events
edge.24-08-23
Fixed
- Correctly sets PgBouncer permissions on new PostgreSQL cluster creation in kube_pg_cluster.
edge.24-08-22
Breaking Changes
-
The default behavior of kube_redis_sentinel was to use both Redis AOF and RDB for persistence. Unfortunately, using AOF concurrently with RDB negates Redis' the ability to do partial resynchronizations after restarts and failovers. Instead, a full copy of the entire database must be transferred from the current master to replicas on every restart. This greatly increases the time-to-recover as well as incurs a high network cost.
In fact, there is arguably no benefit to AOF-based persistence with our replicated architecture as new Redis nodes will always pull their data from the running master, not from their local AOF. The only benefit would be if all Redis nodes simultaneously failed with a non-graceful shutdown (an incredibly unlikely scenario).
As a result, we have switched the module to use only use RDB for persistence, and the
redis_appendfsync
input has been removed. The module still provides the ability to provide custom redis configuration, so you can re-enable AOF if you want (though we would not advise it). -
token_lifetime_seconds
has been changed totoken_lifetime_hours
in vault_auth_oidc to avoid a perpetual diff issue present in the Vault provider. -
Removed the daily backups from kube_velero as they were undocumented and had no realistic use case.
Added
-
Adds a new submodule, kube_disruption_window_controller, which can be used to specify time-based disruption windows for disruption-sensitive workloads (e.g., databases). Disruption window capabilities have also been added to kube_pg_cluster and kube_redis_sentinel.
-
Adds synchronous replication support to kube_pg_cluster via
pg_sync_replication_enabled
.
Fixed
-
Addressed issue where
pg_smart_shutdown_timeout
cannot be set to 0 in kube_pg_cluster without having CNPG reset it to 180. -
Fixed an issue in kube_velero where stale EBS snapshots were not being deleted.
-
Added stricter disruption prevention to the Velero server in kube_velero as disrupting the server in the middle of a backup operation would cause it to fail and not be resumed.
edge.24-08-15
Breaking Changes
pg_shutdown_timeout
has been renamed topg_smart_shutdown_timeout
to better indicate its purpose in kube_pg_cluster. Additionally, the shutdown and failover logic has been overhauled. The new default will immediately terminate running queries when a database pod is killed, but this serves to reduce the downtime from 60-120 seconds to < 5 seconds in the failover scenario. Please see the module documentation for more information.
Added
-
Adds the concept of passthrough parameters to wf_spec.
-
Makes
tf_apply_dir
a Workflow parameter in wf_tf_deploy so that you only need a single instance of this module per cluster. -
Adds the ability to use
templateRef
to compose Workflows in wf_spec.
Fixed
-
Fixed the working directory in wf_tf_deploy and wf_dockerfile_build to be inside the cloned repository.
-
Addressed OOM errors when using resource templates with wf_spec.
edge.24-08-13
Breaking Changes
-
pg_storage_increase_percent
has been changed topg_storage_increase_gb
in kube_pg_cluster. This allows for more predictable storage autoscaling and optimal resource provisioning regardless of the current storage scale. -
pg_storage_gb
has been changed topg_initial_storage_gb
in kube_pg_cluster. This better indicates that this value is only used during the initial database provisioning and has no effect thereafter. -
node_vpc_id
,node_subnets
, andnode_security_group_id
have been moved from kube_karpenter to kube_karpenter_node_pools in order to simplify the logic of assigning nodes to subnets, VPCs, and security groups. Additionally, we have removed Karpenter auto-discovery tags as they are no longer necessary.
Added
-
Adds new enhancements to the kube_pg_cluster module:
- Better defaults and options for memory tuning
- Provides the ability to set arbitrary PostgreSQL parameters
- Provides the ability to set a custom backup schedule
- Adds support for additional schemas via the
extra_schemas
input
-
Adds another local retry for Terragrunt when providers produce an inconsistent final plan.
-
Adds check for an updated
direnv
version to prevent issues when setting up the local devenv.
Fixed
-
Added deterministic ordering to additional resources in authentik_core_resources.
-
Fixed the following bugs in
pf-env-bootstrap
:- Would use a non-existent AWS profile for the
.sops.yaml
file. - Would not install all the platform checksums in the
.terraform.lock.hcl
files.
- Would use a non-existent AWS profile for the
-
amd64
nodes are now used whenbootstrapping_enabled
istrue
in aws_eks in order to allow certain bootstrapping tests (e.g., Cilium) to run successfully. -
Restores the
pf-db-tunnel
command to the devenv. -
pf-get-version-hash local
now properly returnslocal
without an error code. -
Updates the Panfactum image version in kube_constants to a version that is compatible with the latest pre-built workflows.
edge.24-08-12
Breaking Changes
-
Repository variables must now be defined in a
panfactum.yaml
file located at the root of your repository instead of in yourdevenv.nix
. Additionally, the variables names are no longer prefixed withPF_
and are lowercase.For example,
env.PF_REPO_NAME
indevenv.nix
should now be defined atrepo_name
inpanfactum.yaml
.This change was made to make it easier to reference these values outside of local development contexts such as within CI pipelines where
devenv.nix
isn't loaded.
Added
-
We have provided two new addons, a Workflow Engine (Argo Workflows) and an Event Bus (Argo Events).
-
We have created a guide and best practices for setting up CI / CD in the Panfactum Stack.
-
To support the new addons, we are upgrading the following infrastructure modules to Beta status:
- kube_argo: For deploying the Argo controllers
- kube_argo_event_bus: For deploying an Argo EventBus
- kube_argo_event_source: For deploying an Argo EventSource
- kube_argo_sensor: For deploying an Argo Sensor
- wf_spec: For creating an Argo Workflow specification
- wf_tf_deploy: For creating an Argo WorkflowTemplate that deploys IaC modules
- wf_dockerfile_build: For creating an Argo WorkflowTemplate that builds container images from Dockerfiles
-
Adds
pf-get-repo-variables
which prints a JSON payload of all repository configuration variables with the appropriate defaults set.
edge.24-07-08
Breaking Changes
-
We have made a small, breaking refactor of aws_eks to reduce unnecessary options that made onboarding and maintenance more difficult:
- Most importantly, users will no longer able to set the instance type and count for nodes in EKS node groups. This
flexibility is unnecessary since node provisioning is handled
by Karpenter and not EKS. Moving forward, there are just two static configurations that are guaranteed to work
in all use cases: one for before autoscaling is installed and one for after. This is controlled via the new input,
bootstrap_mode_enabled
(default:false
). control_plane_version
andcontroller_node_kube_version
have been unified into a single variable calledkube_version
that applies to all subsystems.controller_node_subnets
has been renamed tonode_subnets
to indicate these subnets are used for all cluster nodes, not just the EKS node groups.all_nodes_allowed_security_groups
has been renamed tonode_security_groups
to align naming conventions
- Most importantly, users will no longer able to set the instance type and count for nodes in EKS node groups. This
flexibility is unnecessary since node provisioning is handled
by Karpenter and not EKS. Moving forward, there are just two static configurations that are guaranteed to work
in all use cases: one for before autoscaling is installed and one for after. This is controlled via the new input,
-
By default, PVCs created by controllers such as StatefulSets can not be updated through their controller as their template (
volumeClaimTemplates
) is immutable (a Kubernetes limitation). This poses a challenge when needing to update PVC metadata such as annotations and labels. We have built a workaround to this (kube_pvc_annotator) and incorporated it in various Panfactum modules. Unfortunately, incorporating this enhancement requires redeploying StatefulSets.To complete this upgrade, perform the following steps:
-
Create a Velero backup of the cluster by running
velero create backup -w <backup_name>
to recover in case of mistakes. -
The following StatefulSets need to be deleted in this order AND with
kubectl delete --cascade=orphan
AND immediately restored with a subsequentterragrunt apply
to their defining module:- The Vault StatefulSet created by
kube_vault
- The Redis cluster StatefulSet for Authentik created by
kube_authentik
- The BuildKit StatefulSets created by
kube_buildkit
- Any StatefulSets you have provisioned with kube_stateful_set
- Any Redis clusters StatefulSets you have provisioned with kube_redis_sentinel
As long as you use
--cascade=orphan
and take care to minimize the time between thekubectl delete
andterragrunt apply
, there will not be any downtime during this operation. - The Vault StatefulSet created by
-
After completing this operation, you need to delete the backing PVCs from each module one at a time by deleting the PVC and then deleting its bound pod. The controller will then automatically provision a new PVC with the correct labels and annotations to take advantage of the new functionality.
After deleting each pod, ensure that a new pod is automatically provisioned and becomes healthy before proceeding to the next. As long as you proceed one at a time, this will not cause any downtime or data loss.
-
Delete the Velero backup you created in step 1 by running
velero delete backup <backup_name>
.
-
Added
-
Adds kube_fledged to the core stack. The kube-fledged controller adds the ability to pre-pull images to every node to improve pod startup times for critical or frequently used containers such as the Linkerd proxy or database images. We provide instructions for installing this module here
-
Adds the kube_pvc_annotator submodule that will provision a CronJob to run
pf-set-pvc-metadata
against PVCs created by immutable templates. See the module documentation for potential use cases. -
Adds
persistence_backups_enabled
(default:true
) to kube_redis_sentinel to support disabling EBS snapshot backups. -
Adds a new common variable,
node_image_cache_enabled
, to Panfactum modules that can be used to enable pre-pulling images to nodes via thekube_fledged
operator. -
Adds the
pf-buildkit-clear-cache
command for removing any BuildKit caches not being used by an active image build job. -
Adds the
pf-set-pvc-metadata
utility command for syncing labels and annotations across groups of PVCs.
Fixed
-
Fixes handling of public ECR registries in
docker-credential-panfactum
. -
Fixes handling of ECR token caching in
docker-credential-panfactum
. -
Fixes
pf-get-open-port
to be platform-agnostic. -
Fixes
pf-get-version-hash
to work with commit hash inputs. -
Fixes image paths in the Authentik dashboard for applications provisioned by Panfactum modules.
edge.24-07-01
Breaking Changes
-
The input format to aws_ecr_repos has been reformatted to support better per-repository configuration. This should not require replacing any resources, but it will require updating your Terragrunt inputs.
-
The following resources will no longer be tagged with the Panfactum version and commit hash as updates cause unnecessary delays and disruptions during updates for little added value:
- EC2 instances in EKS node groups generated by aws_eks
- EC2 instances serving as NAT hosts in aws_vpc
- KMS replica keys in aws_kms_encrypt_key
- Pods created in kube_bastion
Added
- kube_buildkit has graduated to beta and is now ready for general consumption. This is the first stack addon that can be used to extend the behavior of the core stack. Installation and usage instructions can be found here.
-
aws_ecr_repos now supports custom image expirations rules and both pull and push permissions.
-
aws_ecr_public_repos has been added to support created public ECR repositories.
-
Adds ARM support in kube_bastion and kube_pvc_autoresizer. All core cluster components can now be run on both amd64 and arm64 nodes allow for optimal cost savings.
-
Changes the default
securityContext.fsGroupChangePolicy
toOnRootMismatch
for Pods created by Panfactum submodules in order to improve PVC mounting performance. -
pf-providers-enable
now ensures that.terraform.lock.hcl
files have all common platform checksums. -
Adds
pf-get-terragrunt-variables
which can be used to derive the Terragrunt variables that would be used if Terragrunt were run in the given directory. -
Adds
pf-tf-delete-locks
which can be used to bulk-release Tofu state locks. -
Adds
pf-sops-set-profile
which will update all sops-encrypted files in the given directory to use the indicated AWS profile for KMS operations. This can be used in CI pipelines to allow the CI user to access sops-encrypted files.
-
(Alpha) Adds kube_argo_sensor and kube_argo_event_source submodules for deploying these core components of the Argo Events system.
-
(Alpha) Adds the kube_workflow_spec submodule to help in defining production-ready Argo Workflows.
Fixed
-
kube_aws_ebs_csi has been adjusted to ensure that PVCs are detached from nodes during node shutdown, preventing unnecessary delays in moving PVCs between nodes.
-
kube_core_dns no longer accidentally includes the Vault provider.
-
kube_ingress_nginx will no longer unnecessarily set browser security headers on
3xx
responses or responses that do not haveContent-Type
headers.
edge.24-06-20
Breaking Changes
-
kube_karpenter has upgraded the Karpenter version to
v0.37
. During this release cycle, the Karpenter team moved the CRDs required by Karpenter to a dedicated Helm chart to improve the upgrade ergonomics. Unfortunately, this introduces a few one-time manual steps that you must perform to enable the migration. Specifically, the following commands must be run against your cluster before applying the latest version ofkube_karpenter
:kubectl label crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh app.kubernetes.io/managed-by=Helm --overwrite kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-name=karpenter-crd --overwrite kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-namespace=karpenter --overwrite
-
kube_karpenter_node_pools has a new input
node_labels
which defines what labels will be applied to generated nodes. The standard Panfactum labeling system will no longer apply to Karpenter nodes due to this upstream issue. -
The
persistence_enabled
option was removed from kube_redis_sentinel. Redis is now always deployed with persistence enabled. This decision was made b/c the cross-AZ network costs of re-instantiating Redis nodes without PVC storage dwarf the costs of the PVC storage (by a factor of 100x). As a result, there is no benefit to not periodically saving the redis database to a persistent disk.To compensate for potential performance impacts, we have exposed another input,
redis_appendfsync
. Setting this to"no"
will achieve the same performance as having persistence disabled. However, the default setting of"everysec"
is likely sufficient for the vast majority of use cases and reduces the risk of data loss.Unfortunately, if you were previously running with
persistence_enabled
set tofalse
, you will need to delete the Redis StatefulSets in order to apply the new module.In particular, this impacts the
kube_authentik
module. Before deleting the Redis StatefulSet for Authentik, ensure your Vault token is not expired as you will not be able to re-authenticate with Authentik while the Redis StatefulSet is removed.Since
persistence_enabled
should only have been used in scenarios where data retention was not important, this should be considered a safe operation. However, it will introduce a minor service disruption during the replacement period. -
aws_ecr_pull_through_cache_addresses has been refactored to improve the ergonomics of using the module. It now requires an input,
pull_through_cache_enabled
, and will output the correct registry names regardless of whether using a pull through cache or not.
Added
- kube_deployment, kube_stateful_set, kube_cron_job, and kube_pod have graduated to Beta status. They are now safe to use.
-
Adds the
pf-providers-enable
command that will automatically inspect the source infrastructure module and enable the required providers in a module'smodule.yaml
. -
Adds the
pf-update-iac
command that will update first-party infrastructure modules in the following ways:-
Executes the templating directives.
-
Updates the
ref
in sourced Panfactum submodules to the commit hash of the devenv if the# pf-update
annotation is provided. See the documentation for more details.
-
-
Adds phone number validation in aws_account.
-
Adds
cors_enabled
(default:false
) input variable to kube_vault that can enabled CORS handling.This can be useful when building web applications that interact with Vault in client-side JavaScript. By default, this will allow CORS requests from all sibling and child domains.
Fixed
-
Addresses an issue in kube_authentik that prevented the SSO login pop-up from working.
-
Implements custom CORS handling logic in kube_ingress that resolves issues in the default behavior provided by the NGINX ingress controller.
-
Removes invalid failure cases when using
pf-get-vault-token
in Terragrunt and improve failure messaging. -
Fixes an issue that occurs when the
kubernetes
provider is enabled but the sourced module does not use thekubectl
provider. -
Fixes failure cases in
pf-env-scaffold
and adds more debug logging.
edge.24-06-14
Added
-
Adds kube_scheduler, an alternative Kubernetes scheduler that can be used to improve bin-packing of pods on nodes in the Kubernetes cluster. This allows for better, smaller node selection and our tests show an estimated 25-33% reduction in node costs when used. We provide instructions for installing it here.
-
Adds
panfactum_scheduler_enabled
(default:false
) input to most infrastructure modules. When enabled, will use the scheduler provided by kube_scheduler instead of the less-efficient EKS scheduler. -
If
panfactum_scheduler_enabled
istrue
, the kube_descheduler will automatically remove pods from low utilization nodes to allow the kube_scheduler to bin-pack them on other nodes.
Fixed
-
Addresses a bug in the previous release that left kube_karpenter not deployable.
-
Addresses an issue where nodes were limited a hard cap of 29 pods.
-
Configures Kubernetes nodes to use a fixed amount of system overhead rather than one that scales unnecessarily with node size.
edge.24-06-13
Added
-
Updates kube_pg_cluster with many new variables for configuring PgBouncer. New variables are prefixed with
pgbouncer_
. -
Adds support for
path_prefix
to kube_vault_proxy (@mschnee) -
Adds new
enhanced_ha_enabled
input to many core modules (defaulttrue
). Setting this tofalse
will allow for additional cost savings (approximately $50 / month) in exchange for introducing a small possibility of temporary outages. We estimate that setting this tofalse
reduces availability from 99.995% to 99.9%. This can be used to decrease costs in less critical clusters (e.g.,development
). -
Adds a Spot Data Feed to the aws_account module.
-
Adds the kube_open_cost module for calculating the cost of workloads running on Kubernetes.
Fixed
-
Addressed issue in aws_vpc where NAT nodes wouldn't restart if NAT setup failed with an exit code other than
1
. -
Increased the memory floor of the Authentik server in kube_authentik to avoid OOM issues.
-
Updates kube_authentik to allow showing Gravatar profile images.
-
Updates kube_authentik to provide the necessary Permissions-Policy headers to allow use of WebAuthn devices.
-
Correctly applies pod labels in kube_aws_lb_controller.
-
Removes node preferences defaults from kube_workload_utility that were preventing efficient node deprovisioning.
-
Adjusts the VPA recommendation overhead from 30% to 15% to improve resource utilization.
-
Fixes incorrect SCIM property mapping in authentik_aws_sso.
-
Aligns pod labels, affinities, topologySpreadConstraints, and tolerations in kube_linkerd to conventions used in all other modules.
edge.24-06-08
Added
-
Updates aws_vpc to support new command
pf-vpc-network-test
that will verify network connectivity properties of the instantiated VPC. This allows us to simplify an otherwise complex validation step in the bootstrapping guide. -
Adds the
pf-env-bootstrap
command that automatically bootstraps the necessary resources to begin working with IaC in an environment. This replaces the manual steps that used to be a part of the bootstrapping guide. -
Adds new
extra_inputs
terragrunt variable that allows you to pass inputs to all modules in the current scope. -
Adds arm64 NodePools and arm64 support for the core components. This reduces the cost of running the base stack by $25 - 50 / month due to significantly better price / performance ratios for arm64 instances in AWS.
-
Sets
unhealthyPodEvictionPolicy
toAlwaysAllow
for all module PDBs. This will allow the system to scale up quicker when running against resource pressure and pods become stuck in a temporary crash loop. -
Sets maximum node lifetime to 24h to force Karpenter to try to consolidate instances at least once per day.
Fixed
-
Addressed issue where the
aws-ebs-csi-driver
DaemonSet pods would not be properly terminated by Karpenter during node shutdown. This resulted in EBS volumes not being detached and introduced an unnecessary 6min delay when moving EBS volumes between nodes. -
Replaces most usages of
kubernetes_manifest
withkubectl_manifest
to avoid type manifest parsing issues that prevent dynamic values in manifests.
edge.24-06-06
Breaking Changes
- kube_trust_manager has been deprecated as it's functionality was redundant with
kube_reflector. We are keeping the module
in the repo to support backwards compatibility, but it will be removed in the future. You should perform the following steps to remove it:
- Apply this release.
- Remove any dependency blocks to it in your
terragrunt.hcl
files. - Run
terragrunt destroy
on the module to remove it. - Delete the
bundles
CRD.
Added
-
aws_registered_domains can now set the contact type for each contact.
-
Allow users to reference availability zones by single character (e.g.,
a
) in addition to the full name (e.g.,us-east-2a
) in the aws_vpc module. -
The manual steps needed to reset new EKS clusters to a clean slate during the bootstrapping guide have been consolidated into a single new command,
pf-eks-reset
.
Fixed
-
Addressed issue in aws_vpc that caused a temporary, harmless error to crash the
terragrunt apply
on initial bootstrapping. -
Fixed issue where Cilium test suites would fail during bootstrapping due to a NetworkPolicy blocking the kube_core_dns module.
edge.24-06-04
Breaking Changes
-
The reloader deployment must be deleted before the next apply of kube_reloader. No inputs have changed.
-
The alpha module
kube_labels
has been removed in favor of the labels provided by kube_workload_utility. -
VPC flow logs in aws_vpc are now disabled by default as they can be fairly expensive and should only be used if you have a specific use-case in mind. They can be enabled by setting
vpc_flow_logs_enabled
totrue
.
Added
-
Added new
pf-env-scaffold
script that takes care of setting up thePF_ENVIRONMENTS_DIR
in the bootstrapping guide section for setting up terragrunt. -
Added kube_workload_utility to make it easier to create uniform, production-hardened Pod specs that take advantage of all capabilities included in the Panfactum stack.
-
A new standard label
panfactum.com/workload
can be used to group replicated pods for the purpose of aggregating metrics. This is now applied in all core infrastructure modules. -
Added kube_constants that export static configuration values that can be useful when creating resources that run on clusters in the Panfactum stack.
-
kube_cert_manager will now automatically delete Certificate secrets if the Certificate is deleted.
-
aws_ses_domain now takes an optional input
smtp_allowed_cidrs
that restricts what IPs can use the generated SMTP credentials. This allows users to mitigate credential exfiltration attacks. We provide an example of how to use this here. -
The Vault login UI will now have the OIDC login as the default method.
-
Terragrunt will now automatically retry on some errors up to three times before exiting the process with a failure. This should address intermittent issues such as network disruptions or race conditions.
Fixed
-
.env
files are now properly loaded into the shell environment and changes will trigger fast reloads instead of full devenv re-evaluations. -
Temporarily adds
GIT_CLONE_PROTECTION_ACTIVE=false
to the shell environment in order to address this issue. Note that this only disables new bleeding edge security features which were accidentally shipped in a broken state. -
Adjusts base resource requests of core infrastructure modules to prevent temporary OOM errors when bootstrapping before VPA take effect.
-
kube_authentik now respects
log_level
input. -
Sets
max_history
to5
for all Helm charts to prevent overloading the Kubernetes API server with an every-growing amount of historical Helm deployments.
edge.24-06-02
Breaking Changes
- Upgraded to devenv 1.0. As a part of this upgrade,
.env
file values can no longer be referenced directly inside.nix
files.
Added
- Updated kube_redis_sentinel to automatically limit client buffer size to prevent OOM issues when processing very bursty traffic.
- Added
pf-update
command that runs all the repository scaffolding commands at once.
Fixed
-
Addressed an issue that caused updates to the local devenv to take at least 10 minutes rebuild on macOS. Rebuilds should now be 10-15x faster, but they will still take about 45 seconds at minimum. Note that this only impacts rebuilds and not normal direnv load times which should still be instant.
This is a known limitation of upstream nix's derivation evaluation caching when using flakes. We expect this to be addressed when flakes reach stability.
-
Added missing defaults for
PF_ENVIRONMENTS_DIR
andPF_IAC_DIR
. -
Resolves an issues where devenv warnings could not be resolved during the initial bootstrapping guide.
-
Added extra validation for the terragrunt variable
extra_tags
. Invalid characters will now be replaced with.
for both keys and values for both Kubernetes labels and AWS tags. -
Fixed some core components that were using all Kubernetes labels for
labelSelector
matching rules which prevented Karpenter from autoscaling whenextra_tags
was provided. This previously manifested as the errorspec.requirements: Too many: #: must have at most 30 items
. -
Added extra constraints to kube_external_dns to prevent it from attempting to query zones that it isn't managing.
-
Prevented kube_external_dns from excluding parent domains of included domains.
edge.24-05-30
Breaking Changes
- The default for
vault_storage_size_gb
in kube_vault has been changed from20
to2
in order to improve resource utilization. If you created Vault with the old default, you will need to manually setvault_storage_size_gb
to20
as volume sizes cannot be reduced after creation.
Added
-
(Alpha) Added the Loki logging backend via kube_logging and the Alloy log collector via kube_alloy.
-
The PVC Autoresizer has been added via the kube_pvc_autoresizer module in order to automatically expand EBS volumes as they fill up. We provide the guide for deploying it here.
- Added validation for phone number format in aws_registered_domains. (@wesbragagt)
Fixed
- Resolved issue where scheduling constraints could not be resolved for components deployed before Karpenter (#41)
edge.24-05-23
Breaking Changes
-
We have removed the EKS CoreDNS addon and replaced it with the kube_core_dns module in order to provide better guarantees about the behavior of DNS in the Panfactum stack. In order to migrate:
-
Add the
dns_service_ip
input to aws_eks deployments by following this guide. Double check that thedns_service_ip
is the same IP as defined bykube-system/kube-dns
. -
Additionally, set
core_dns_addon_enabled
totrue
. -
Apply the updated module
aws_eks
module. -
Add the
cluster_dns_service_ip
input to your kube_karpenter_node_pools module like this, and re-apply the module. Ensure that all of your nodes have been replaced with the new configuration. -
Deploy
kube_core_dns
by following this guide. Note that this deployment will fail as the original addon service is still running and the IP is already taken. -
Delete
kube-system/kube-dns
and re-applykube_core_dns
. Note that while the service is deleted, DNS will be temporarily unavailable in your cluster. -
Once you've validated that DNS is working in the cluster, remove the
core_dns_addon_enabled
input from theaws_eks
module and re-apply.
-
-
We have stabilized the label selectors in kube_pod but this requires one final label update for already-deployed Deployments. This will cause re-applies of kube_bastion to fail (and any first-party modules that rely on kube_deployment). To resolve, you must first manually delete the
bastion/bastion
deployment (and all other deployments created by kube_deployment).
- kube_pg_cluster has two
new flags,
pgbouncer_read_only_enabled
(defaultfalse
) andpgbouncer_read_write_enabled
(defaulttrue
), which will enable ther
andrw
poolers, respectively. This will enable users to better control what is deployed so as not to have idle resources. This is a breaking change aspgbouncer_read_only_enabled
is set tofalse
by default.
Added
-
(Alpha) We've added a monitoring stack kube_monitoring which includes HA Prometheus, the Prometheus Operator, Thanos metrics storage on S3 (with deduplication, caching, and down-sampling), the Node Exporter, kube-state-metrics, Alertmanager, and Grafana (with SSO enabled and 20+ custom dashboards).
Additionally, most modules now have an additional
monitoring_enabled
(defaultfalse
) flag that can be turned on to being shipping data to Prometheus for viewing and querying via Grafana. -
(Alpha) kube_cilium now has a new debugging mode,
hubble_enabled
(defaultfalse
), that will capture extensive TCP-level metrics about the cluster as well as expose a debugging UI via HTTPS. -
(Alpha) kube_linkerd now deploys Linkerd Viz when
monitoring_enabled = true
. This provides a service mesh dashboard and the ability to capture and introspect raw HTTP requests sent in realtime. -
(Alpha) We've added the Argo Workflow engine to the stack via the kube_argo module. This will serve as the basis for the future, integrated CI / CD systems and can also be used to process arbitrary events from event queues such as AWS SNS/SQS and Kafka. (@jlevydev)
- A new module, kube_vault_proxy, that can be used to add SSO to web assets that do not have integrated SSO. The module SSO is configured out-of-the-box to work with the cluster's Vault instance.
-
We've included a new Kubernetes provider, kubectl, to augment the original kubernetes provider. The
kubectl
provider allows more flexibility in deploying raw Kubernetes manifests which is required by our templating system. This provider will automatically be enabled thekubernetes
provider is enabled, so no additional changes are required from end users. -
kube_redis_sentinel has a new flag,
lfu_cache_enabled
, that will configure the Redis cluster automatically evict records under memory pressure based on an approximated Least Frequently Used algorithm.
- kube_ingress now takes an
extra_configuration_snippet
variable which allows for additional commands in the NGINX configuration snippet.
Changed
- Added the standard Restricted Reader role to Vault instances (
rbac-restricted-reader
) and updated vault_auth_oidc to takerestricted_reader_groups
. Since cluster resources authenticate with SSO via Vault, this allows restricted readers to access additional cluster resources such as Grafana and Argo Workflows (albeit, in a locked-down read-only mode).
-
Disabled evictions of database pods based on max lifetimes. This improves the stability of databases deployed by Panfactum modules.
-
After completing the bootstrapping guide, we now recommend that users update their
aws_eks
cluster modules to havecontroller_node_count
set to1
andcontroller_node_instance_types
set to["t3a.medium"]
. This will decrease the costs of the base cluster by about 40% without impacting cluster availability or resiliency. The single remaining node is used primarily as a place for Karpenter to run (Karpenter cannot run on instances that it itself provisions). -
kube_karpenter now only deploys a single instance of Karpenter and enforces that it is run on a controller node. This reduces the overall resource utilization of this fairly heavyweight controller.
- Kubernetes labels applied via the
extra_tags
terragrunt input are now sanitized for valid characters automatically (invalid characters are replaced with.
). (@mschnee)
-
Added scheduling constraints to prevent critical workloads from scheduling all pods on the same instance type in order to minimize the possibility of disruption on events that only affect one instance type (e.g., spot node preemption).
-
Changes many other non-critical core controllers to only have a single replica when 100% uptime is not necessary in order to reduce resource utilization in the Stack.
-
Updates many controller deployments to use the Recreate deployment strategy to improve timing and efficiency of applying Panfactum upgrades.
- kube_vpa has a new
history_length_hours
(default24
) that will control how far back it will analyze metrics for computing its recommendations.
Fixed
- PVCs for postgres instances were inadvertently created with duplicated entries for accessModes. This has no functional impact, but confused monitoring systems. This has been fixed, but the fix will not retroactively adjust existing PVCs as they are immutable.
edge.24-05-15
Breaking Changes
- kube_vault now takes
vault_domain
as an input instead ofenvironment_domains
. This change was made as having multiple domains for Vault is incompatible with using Vault as an intermediary IdP.
Added
-
New kube_reflector module for deploying the Reflector in order to synchronize ConfigMaps and Secrets across namespaces. Created a new guide section for deploying the module as a part of the foundational Stack.
-
pg_shutdown_timeout
variable to kube_pg_cluster to control how long the postgres instances will wait for active connections to close before shutting down.
Fixed
- Fixed an issue where simultaneous, graceful shutdown of all postgres nodes in a kube_pg_cluster would cause unnecessary downtime when the primary was running on a spot instance.
edge.24-05-12
The initial edge release of the Panfactum stack!