Edge Releases
Edge releases do not receive patches nor make any backwards compatibility guarantees. Learn more here.
To use stable Panfactum releases, please see our available licenses.
To upgrade your Panfactum stack version, please follow the instructions in the upgrade guide.
Unreleased
Added
- Adds ability to pass in extra service annotations through kube_deployment module
Fixed
-
Fixes bug that prevented kube_cert_manager from being deployed when
self_generated_certs_enabled
was set totrue
. -
Fixes
aws_eks
subnet validation check that prevented module deployment in some valid scenarios
edge.25-01-09
Added
- kube_policies now has
common_env
andcommon_secrets
inputs that inject environment variables into all containers in the cluster.
Fixed
-
Pins Bottlerocket OS AMIs to pre-tested versions as AWS occasionally publishes breaking AMI changes that can crash nodes in the cluster.
-
Fixes the pre- and post- condition check for the
aws_eks
module whensla_target
is set to 1.
edge.25-01-04
Breaking Changes
-
This release adds some additional functionality to Vault which requires vault_auth_oidc to be upgraded before any other module.
-
The
kube_rbac
andkube_priority_classes
modules have been removed per the deprecation notice inedge.24-12-13
.
Added
-
Adds a module for deploying Grist, a next-generation spreadsheet system: kube_grist.
-
Adds an alternative mechanism for creating dynamically-rotated AWS credentials for when IRSA is not an option: kube_aws_creds.
-
kube_deployment and kube_stateful_set now provide native support for voluntary disruption windows.
Fixed
-
Addressed issue where pods could not be created if all Kyverno admission controllers are disrupted simultaneously. As the Kyverno admission controller is itself composed of pods, this would result in a cluster deadlock that required manual intervention. This degenerate behavior has been fully resolved in this release.
-
Addressed issue where the Kubernetes API server address was set incorrectly when deploying kube_cilium with wf_tf_deploy.
-
Helm charts deployed by Panfactum modules will not be automatically rolled back on deployment failure which should prevent several failure cases where manual intervention would have otherwise been necessary.
-
The StatefulSets in kube_nats no longer need to be redeployed after each update of resource tags / labels.
-
pf-tunnel
now binds to127.0.0.1
instead oflocalhost
to resolve potential connectivity problems on diverse operating systems.
edge.24-12-19
Breaking Changes
-
Introduces the concept of SLA Target Levels. This makes it easier to (a) know what uptime you can expect from Panfactum deployments, and (b) make it easier to adjust the cost-to-availability tradeoff for entire subsections of the deployment.
This features comes with the following changes:
- Provides a new Terragrunt variable,
sla_target
, that can be used to set the target level for a particular scope (e.g., environment, region, module). It defaults to3
. - The default behavior for Panfactum modules will now automatically adjust to the provided
sla_target
. - The
enhanced_ha_enabled
input has been removed from all modules. The previous behavior whenenhanced_ha_enabled
was set totrue
(the default) is now equivalent to settingsla_target
to3
(the default).
- Provides a new Terragrunt variable,
-
This release upgrades the following terraform provider versions which will need to be updated in first-party IaC:
pf
: 0.0.5 -> 0.0.7
Added
-
Adds support for arbitrary path rewriting in kube_ingress, kube_aws_cdn, aws_cdn, and aws_s3_public_website.
-
wf_dockerfile_build now supports sourcing base images from private ECR repositories.
-
Adds
not_found_path
to aws_s3_public_website to facilitate specifying the asset to load when no object exists at the requested path. -
Adds
custom_error_responses
to aws_cdn which can be used to overwrite error responses from the upstream origin.
Fixed
-
Addressed conflicting PDB issue with the kube_redis_sentinel module that prevented vertical autoscaling from working.
-
Standard Panfactum environment variables for Kubernetes workloads are now injected before user-defined environment variables to make them available for use in dependent variables.
-
Standard Panfactum environment variables for Kubernetes workloads will no longer override user-defined environment variables.
-
Addressed issue where the CRDS in kube_aws_lb_controller were not automatically upgraded.
-
Fixed incorrect AWS permissions in kube_aws_lb_controller.
edge.24-12-13
Breaking Changes
-
The
kube_rbac
module has been deprecated and will be removed in the next release. Please destroy any deployments of it after upgrading aws_eks.Kubernetes access control has now been moved to the aws_eks module using EKS access entries. This provides several benefits:
- Kubernetes RBAC now works out-of-the-box, making cluster bootstrapping simpler.
- Accidental lock-out is now fully prevented.
- One fewer location where custom SSO roles need to be synchronized.
-
The
kube_priority_classes
module has been consolidated with kube_policies in order to remove a superfluous bootstrapping step. Please destroy any deployments of it immediately before upgrading kube_policies. -
eks_cluster_name
is no longer an input to most submodules as it is now dynamically resolved based on which cluster you are deploying to. -
This release upgrades the following terraform provider versions which will need to be updated in first-party IaC:
pf
: 0.0.4 -> 0.0.5authentik
: 2024.6.1 -> 2024.8.4
Changed
- Upgrades Authentik in kube_authentik to 2024.8.2 (release notes).
Fixed
-
Adds correct permissions to allow users to retry specific Workflow nodes in Argo Workflows.
-
Adds automatic NATS connection retries to Argo Events components.
-
Addresses issue in wf_dockerfile_build where the
git_ref
could not be a branch name.
edge.24-12-11
Breaking Changes
-
All terraform provider versions in Panfactum modules have been upgraded to new values so any first-party IaC modules that utilize Panfactum submodules will need to have their provider versions upgraded as well.
-
This release upgrades many components of the Panfactum Stack. Generally, none of these upgrades should require any action on your part. However, see the release notes for each component for more information:
- Kubernetes: 1.29 -> 1.30
- Authentik: 2024.4.2 -> 2024.6.4
- Argo Workflows: 3.5 -> 3.6
- Karpenter: 1.0 -> 1.1
- Redis: 7.2 -> 7.4
- Velero: 1.13 -> 1.15
- VPA: 1.1 -> 1.2
- PostgreSQL: 16.4 -> 16.6
Added
- aws_eks and kube_karpenter_node_pools can
now configure each node's root volume size via
node_ebs_volume_size_gb
.
Fixed
- Addresses issue where non-HA clusters could not recover when many nodes are disrupted at once.
edge.24-12-10
Breaking Changes
-
This release changes the way that public ingress TLS certificates are provisioned in order to avoid hitting rate limits on large clusters. This architectural update requires that the modules be upgraded in the following order:
- kube_cert_issuers
- kube_ingress_nginx. To avoid service disruptions, you MUST wait until all the old NGINX pods have been fully terminated before proceeding.
- The remainder of the modules may be updated in any order.
Fixed
-
Adds
bootstrap_cluster_creator_admin_privileges
input to aws_eks to provide backwards compatibility with clusters that were created with this field set totrue
. -
Temporary Authentik disruptions caused by PostgreSQL database failovers have been mitigated.
edge.24-12-05
Breaking Changes
-
This release contains a major version upgrade to Linkerd.
This upgrade removes the need for the privileged
proxy-init
initContainer to be injected into every container as the initialization logic is now completed once per node. This should reduce pod startup times by 5-20 seconds and improves overall security by removing the need to run a privileged container in each pod.To upgrade with no downtime, you MUST update the modules in the following order:
- kube_kyverno
- kube_policies
- kube_cilium
- kube_linkerd
- aws_eks
- kube_karpenter_node_pools
- The remainder of the modules may be updated in any order.
-
The NATS backend for kube_argo_event_bus has been replaced with our enhanced NATS module, kube_nats. This provides improved availability, security, observability, and performance.
To apply this module, you will need to manually delete any existing
EventBus
resources in our cluster, or you will receive an error. You will also need to delete any associatedEventSource
orSensor
resources before deleting theEventBus
or theEventBus
deletion will be blocked.Deleting an existing EventBus will cause any unprocessed events to be deleted. Make sure that you have no pending events before performing this upgrade.
-
The
kube_fledged
andkube_reflector
modules have been removed (they were deprecated inedge.24-11-13
). -
The
images
input of kube_node_image_cache has been updated to take a list of image configuration options rather than a list of image strings.Additionally,
node_image_cached_enabled
has been removed as a top-level input from Panfactum submodules (e.g., kube_deployment) as image cache settings can now be configured on a per-container basis.
Changes
-
Added support for the NATS Jetstream message broker via a new submodule, kube_nats. This release also adds NATS integration with the devShell tooling including adding the
nats
CLI and updatingpf-db-tunnel
to support connecting with NATS clusters. -
aws_eks now launches with
arm64
nodes whenbootstrap_mode_enabled
istrue
as we have resolved the remaining issues that have preventedarm64
from being used during bootstrapping. -
aws_eks now has EKS access entries enabled.
-
aws_eks now has ARC Zonal Shift enabled if running nodes in multiple subnets.
-
kube_ingress_nginx now has ARC Zonal Shift enabled.
-
kube_vault now schedules pods exclusively on
arm64
nodes in order to support the integration of external secret plugins.
Added
-
The kube_node_image_cache_controller has been updated with a "prepull" component that automatically pulls cached images in parallel as soon as a node launches. Previously, images were pulled serially which resulted in significant delays when many large images were cached.
-
The kube_descheduler will now automatically recreate pods that were not run through the Kyverno policy engine. This provides protection in case the Kyverno admission controller is ever offline.
-
Images provided to and/or used by Panfactum submodules (e.g., kube_deployment, kube_pg_cluster, etc.) are now cached by default.
-
Additional annotations and labels can now be added to the controllers created via kube_deployment, kube_stateful_set, kube_daemon_set, and kube_cron_job.
-
The
kyverno
CLI has been added to the devShell. -
Adds support for dynamically generated labels in wf_spec via
labels_from_parameters
andlabels_from
. -
kube_argo_event_source now creates a ServiceAccount and output's its name. This can be used to assign AWS (or other permissions) to the EventSource pods.
-
Adds the ability to configure temporary storage space size in wf_tf_deploy.
Fixed
-
The kube_node_image_cache_controller will now deduplicate images that are added to the cache by kube_node_image_cache.
-
We have adjusted the Kyverno settings to improve overall stability of the mutation engine.
-
Resolved slow Vault startup times for Vault databases larger than 100MB in kube_vault.
-
BuildKit cache PVCs are now excluded from Velero backups as they consume a lot of storage and are safe to delete.
-
Fixed root user access provisioning in kube_rbac.
-
Addressed issue where the Descheduler was not replacing pods that were older than the max lifetime.
-
Addressed issue where resetting a one's own password via Authentik caused an unauthorized error.
-
Fixed mount permissions in wf_spec.
edge.24-11-13
Breaking Changes
-
We have added the Kyverno policy engine as a core part of the Panfactum Stack. Kyverno allows us to install rules onto the cluster to automatically generate, mutate, or validate resources based on a powerful, Kubernetes-native expression language. This provides several benefits:
- Provides a unified control plane for adding functionality that previously required managing additional controllers or custom scripts.
- Allows us to simplify several parts of our IaC modules by offloading resource management to global Kyverno policies.
- Allows us to add Panfactum-compatible, sensible defaults to Kubernetes resources that are not created by Panfactum modules.
- Allows users to add management logic to their clusters that was previously only possible by building and deploying custom controllers. See the example policies.
You must install Kyverno by following this new bootstrapping guide section. Many modules now depend on Kyverno and will not function without it.
-
kube_fledged
has been removed in favor of a new node-local image caching mechanism built by Panfactum on top of Kyverno. The new mechanism has the following benefits overkube_fledged
:- The node's image cache will be created immediately when a node launches, concurrently with other node setup steps.
- Cached images will never be removed from the node's image store.
- Overall controller performance is significantly improved reducing the overall resource requirements for caching.
- The caching mechanism no longer generates pods that prevent Karpenter from disrupting underutilized nodes.
To install the new mechanism, please follow this guide. To start caching images, you may use the new kube_node_image_cache module. Additionally, we provide a new input to our submodules such as kube_deployment called
node_image_cached_enabled
that when enabled will automatically add the submodule's images to the node-local image cache.kube_fledged
must be removed from your clusters before upgrading to the next version as it will no longer be available in the next release. It should not be removed until Kyverno is installed. -
kube_reflector
has been removed in favor of a new syncing mechanism built by Panfactum on top of Kyverno.- To sync ConfigMaps, use kube_sync_config_map.
- To sync Secrets, use kube_sync_secret.
kube_reflector
must be removed from your clusters before upgrading to the next version as it will no longer be available in the next release. It should not be removed until Kyverno is installed. -
Vertical pod autoscaling now works for both the PostgreSQL clusters and Pgbouncer deployments created by the kube_pg_cluster submodule. The following variables have been removed:
pg_memory_mb
pg_cpu_millicores
and the following variables have been added:
pg_minimum_memory_mb
pg_maximum_memory_mb
pg_minimum_cpu_millicores
pg_maximum_cpu_millicores
pgbouncer_minimum_memory_mb
pgbouncer_maximum_memory_mb
pgbouncer_minimum_cpu_millicores
pgbouncer_maximum_cpu_millicores
This change also resolves issues where some values for
pg_cpu_millicores
caused a permanent reconciliation conflict. -
All pods in Panfactum clusters will now automatically tolerate
arm64
andspot
node taints regardless of whether they were created by Panfactum modules (this was already the default for Panfactum modules). To disable these tolerations for a specific pod, you must add thepanfactum.com/arm64-enabled = "false"
orpanfactum.com/spot-enabled = "false"
labels, respectively.
Changed
-
We have upgrade the CNPG operator in kube_cloudnative_pg to 1.24 (up from 1.23). This adds additional stability improvements during failover events.
After performing this upgrade, you MUST use the new kube_pg_cluster submodule as well. Old versions are no longer compatible.
-
We have upgraded the default PostgreSQL version in kube_pg_cluster to 16.4 (up from 16.2). This upgrade should not require any action on your part, but be sure to pin your PostgreSQL version if you do not want to be automatically upgraded.
Added
- Adds a new submodule, kube_daemon_set, for creating Kubernetes DaemonSets.
Fixed
-
Added Kyverno rule that forces linkerd sidecars to terminate prior to the pod's
terminationGracePeriodSeconds
to ensure that pods are not marked as "failed" by controllers such as Argo if the main container has a TCP connection leak. -
Resolved unnecessary log noise that was introduced in the last release when running Terragrunt commands.
-
Adjusted Cilium deployment to address edge cases where Cilium would not successfully launch new nodes after a complete zonal or cluster outage.
edge.24-10-25
edge.24-10-23
Breaking Changes
-
The required Nix version to use the Panfactum Stack has been updated to
>= 2.23
(up from>= 2.18
). The latest Nix versions include performance improvements required to make local development ergonomic on all operating systems. Additionally, we have added a check to the loading script (.envrc
) to ensure that users have a compatible Nix version installed.If you installed Nix using the Determinate Systems installer, see these upgrade docs.
Changed
- Panfactum modules are now downloaded as gzipped tarballs from an HTTPS server rather than requiring a full git clone of the Panfactum Stack repository. This should dramatically improve initialization speed of modules and reduces network bandwidth by over 90%. This is an internal refactor that should not have any impact on how you use Panfactum modules.
Added
- Added a new module, aws_s3_public_website, to enable users to serve files directly from an S3 bucket via CloudFront.
- aws_cdn can now handle CORS headers on behalf of the origin servers
- aws_cdn now uses 10x more efficient CloudFront functions for request / response mutations.
Fixed
-
Deploying modules that use Helm charts hosted in ECR (e.g., kube_karpenter) will now use the appropriate credentials.
-
Upgraded Argo Workflows to fix some issues related to workflow timeouts being ignored.
edge.24-10-21
Breaking Changes
-
In all Panfactum submodules,
instance_type_spread_required
has been renamed toinstance_type_anti_affinity_required
as we have had to replace TopologySpreadConstraints with AntiAffinity rules to work around this issue with Karpenter.This change will ensure that Karpenter will not randomly create massive nodes.
-
To add further protection against Karpenter provisioning extremely large nodes, we have two variables for kube_karpenter_node_pools,
max_node_memory_mb
andmax_node_cpu
, that limit the maximum size of nodes that can be provisioned.The default limits are 64GB of memory and 32 CPUs. If you require nodes larger than these limits, you will need to adjust these new inputs.
Fixed
-
Prevents Karpenter from scheduling instances on bare metal instances which we have observed issues with.
-
Removes memory limits on the Cilium node agent in kube_cilium as these limits can cause Cilium to fail to launch on larger node sizes. This is due to the fact that Cilium's memory requirements increase proportionally to the size of the node, but the VPA does not take this into account when assigning limits.
-
Upgrades kube_ingress_nginx so that it can run on nodes with a large number of CPU cores.
-
EBS-backed PVs with many large files took a long time to mount due to this issue with Bottlerocket OS (our underlying node OS). We have added the recommended remediation and now PVs should mount nearly instantly. Note that this fix will not apply to existing PVs, only new ones.
To apply the fix to existing PVs, you will need to manually add the following mount option to their manifests:
apiVersion: v1 kind: PersistentVolume metadata: name: XXXX spec: mountOptions: - context="system_u:object_r:local_t:s0"
edge.24-10-18
Breaking Changes
-
We have removed devenv from the Panfactum Stack and now use plain nix flakes to manage the local development shell (aka the "devShell"). We did not use the vast majority of the features in devenv, and its removal comes with a couple key improvements:
-
Greatly increased performance on macOS. Initial installation should now take ~ 5 minutes (down from 45+). Additionally, opening the devShell after initial installation should now be instant.
-
More control and flexibility of the Panfactum setup which will allow us to better implement future Panfactum features.
However, this does come with a few key changes that you must perform manually:
-
The syntax for your
flake.nix
has changed.Before:
{ inputs = { # Change 'nixos-23.11' to whichever cut of the nixpkgs repository # you want to use in your project. This will NOT impact the Panfactum stack at all. # For available versions, see https://github.com/NixOS/nixpkgs # We recommend using the version that is supported here: # https://search.nixos.org/packages (updated every 6 mo) pkgs.url = "github:NixOS/nixpkgs/nixos-23.11"; # Change 'main' to be the release version that you desire # Ensure that this matches the version you use for your infrastructure modules panfactum.url = "github:panfactum/stack/edge.25-01-09"; }; outputs = { self, panfactum, pkgs, ... } @ inputs: { devShells = panfactum.lib.mkDevShells { inherit pkgs; modules = [ (import ./devenv.nix )]; }; }; }
After:
{ inputs = { flake-utils.url = "github:numtide/flake-utils"; # Utility for generating flakes that are compatible with all operating systems panfactum.url = "github:panfactum/stack/edge.25-01-09"; # Make sure this matches your version of the Panfactum Stack }; outputs = { panfactum, flake-utils, ... }@inputs: flake-utils.lib.eachDefaultSystem (system: { devShell = panfactum.lib.${system}.mkDevShell { }; } ); }
-
We no longer support
devenv
syntax, so yourdevenv.nix
file and the.devenv
directory can be removed.
For alternatives to all the functionality included in devenv using our new devShell paradigm, please see our documentation.
-
-
pf-get-version-hash
has been renamed topf-get-commit-hash
to better reflect what it does (get a commit hash given an arbitrary repo and git ref). In addition, it has been updated to take named rather than positional arguments in order to align with other Panfactum scripts. Finally, we have fixed several bugs in the script to make it more resilient to various inputs. -
Removes
pgadmin4
from the devShell as it significantly increased build times and was not useful to all users. Users should have an option to pick their favorite DB clients rather than us be prescriptive.
Changes
-
Upgrades kube_cilium to v1.16.3. This change brings new Cilium features, reduces the per-node memory usage by 75MB, and reduces the amount of errors that users can encounter during the bootstrapping guide.
-
Upgrades kube_aws_ebs_csi to v1.36 in order to support Karpenter v1 disruption taints and improve node shutdown performance.
-
Updates wf_dockerfile_build to support 10 concurrent image builds per module rather than just one.
Added
- Adds
cdn_mode_enabled
boolean to the kube_vault & kube_authentik module to enable CDN for Vault. - Adds
image_tag_prefix
string to the wf_dockerfile_build
Fixed
-
Fixed a handful of scheduling constraint bugs that resulted in less-than-optimal resource utilization. These improvements should result in a significant improvement to resource utilization in tiny clusters and a minor improvement in larger clusters.
-
Fixed an issue where
pf_stack_version
could not be a commit hash. It can now be any valid git ref. -
Fixed an issue where
pf-wf-git-checkout
would fail when given a branch name as a git ref. This impact both wf_tf_deploy and wf_dockerfile_build.
edge.24-10-15
Breaking Changes
-
This release integrates the new Panfactum provider and removes the need to pass many different variables through the module tree.
Additionally, we have upgraded OpenTofu to v1.8 which now supports variables in module
source
fields. To take advantage of this, we now pass two new inputs to every module by default:pf_module_source
andpf_module_ref
.This greatly simplifies developer experience for first-party modules by removing boilerplate with no loss of functionality.
Original:
module "namespace" { source = "github.com/Panfactum/stack.git//packages/infrastructure/kube_namespace?ref=c817073e165fd67a5f9af5ac2d997962b7c20367" #pf-update namespace = "example" # pf-generate: pass_vars pf_stack_version = var.pf_stack_version pf_stack_commit = var.pf_stack_commit environment = var.environment region = var.region pf_root_module = var.pf_root_module is_local = var.is_local extra_tags = var.extra_tags # end-generate }
Simplified:
module "namespace" { source = "${var.pf_module_source}kube_namespace${var.pf_module_ref}" namespace = "example" }
For more information, see the updated first-party IaC development documentation.
This does come with a couple breaking changes:
-
Terragrunt no longer passes the following inputs to modules by default as they can be accessed via the Panfactum provider:
pf_stack_version
pf_stack_commit
environment
region
pf_root_module
is_local
-
The templating system and
pf-update-iac
have been removed as they are no longer necessary.
-
-
kube_ingress no longer allows
rewrite_rules
to be specified oningress_configs
. Instead, there is now a top-levelredirect_rules
variables that has enhanced capabilities:- Can pattern match against the entire url (
https://google.com/some/path
) instead of just the path component (/some/path
). - Can specify whether a permanent or temporary HTTP redirect should be used.
- Can pattern match against the entire url (
-
kube_ingress no longer allows
domains
to be specified on individualingress_configs
. Instead,domains
is now a top-level variable. This provides better compatibility with the new CDN option and prevents confusing behavior in several edge cases. This also better matches the intent of the module: to provide routing rules for a single set of domains, not to provide routing rules for all domains in your system.
Added
-
A new module, kube_aws_cdn, has been created that enables setting up a CloudFront distribution (CDN) in front of Ingress resources for improved performance and security as well as reduced server costs. kube_ingress has been updated to support CDN settings.
Additionally, a non-Kubernetes CDN module, aws_cdn, has also been created.
-
A new module, aws_dns_zones, has been created that allows you to create Route53 zones that have a non-AWS registrar.
-
Adds the
acl_aws_logs_delivery_enabled
input to aws_s3_private_bucket which makes it easier to use the bucket for AWS log delivery purposes. -
Adds support for Cloudflare in kube_external_dns and kube_cert_issuers
Changed
tls_1_2_enabled
now defaults totrue
in kube_ingress_nginx in order to support CDNs like CloudFront which do not yet support TLSv1.3.
Fixed
-
The internal logic of aws_dns_records has been updated so that each record is managed independently of the others. This fixes an issue where adding or removing records would cause all records to be created. However, this update will cause all records to be recreated one last time.
-
pf-wf-git-checkout
no longer automatically appends a.git
to the end of given repo URLs as this is incompatible with some git hosting providers (e.g., Azure DevOps). This does mean that therepo
variable input to wf_tf_deploy and wf_dockerfile_build should be updated to include the.git
suffix if required for cloning over HTTP. -
Pinned helm provider version for
kube_redis_sentinel
submodule.
edge.24-10-09
Added
- Adds a new terragrunt variable,
pf_stack_local_path
, that can be used to deploy local copies of the Panfactum Stack modules. This can be used by developers when testing changes to Panfactum modules on personal infrastructure before submitting pull requests to the Stack repository.
Changed
-
Loosened the requirements for the repo variable
repo_url
so that we can now support users on arbitrary git hosting providers (not just GitHub). -
pf-env-bootstrap
is now idempotent, allowing it to be re-run if it fails in the middle of its initial execution.
Fixed
- Fixes the ami instance type mismatch when
bootstrap_mode_enabled
is enabled in the aws_eks module. - Fixes issues that prevented bootstrapping scripts from running with new
pf-tf-init
logic. - Adjusts the defaults for kube_reflector so that installation does not fail in the bootstrapping guide.
edge.24-09-30
Added
- Adds a new addon for self-hosted GitHub Action runners.
- Adds the
pf-eks-suspend
andpf-eks-resume
command to Suspend and Resume the EKS Cluster.
Fixed
- Fixes an issue where voluntary disruption windows created by the kube_disruption_window_controller
would only work for the
argo
namespace. They will now work in all namespaces.
edge.24-09-12
Breaking Changes
-
The kube_secrets_csi has been deprecated and should be removed from your clusters. It was primarily used for managing dynamically generated Vault secrets such as database credentials, but we have switched to a new paradigm that uses the Vault Secrets Operator.
This saves approximately 150MB of memory per node in the cluster, improves security by removing pods that needed elevated host-level permissions, and provides better ergonomics for managing dynamically generated secrets in our modules.
-
kube_pg_cluster's and kube_redis_sentinel's
superuser_username
andsuperuser_password
outputs have been renamed toroot_username
androot_password
, respectively. We made this change because "superuser" implies Vault-generated credentials, which these are not. -
pf-providers-enable
has been renamed topf-tf-init
as it now has expanded functionality:- Now influences every module in the directory tree where it is run rather than just the module in the CWD.
- Now runs
init -upgrade
on every module to update provider versions and download internal submodules when performing Panfactum version upgrades. - The runtime speed has been improved in order to accommodate running against many modules at once.
We have updated the upgrade guide to reflect that
pf-tf-init
should be run every time you upgrade the Panfactum version in an environment. -
You now no longer need to manually enable providers via the
providers
array in eachmodule.yaml
. Our Terragrunt configuration now automatically detects which providers need to be included at runtime.No changes are required to take advantage of this new functionality. However, the
providers
Terragrunt input no longer has any functionality and theproviders
array can be removed from allmodule.yaml
files. If this leaves amodule.yaml
empty, the entiremodule.yaml
file can be deleted.
Added
-
Adds
common_env_from_config_maps
andcommon_env_from_secrets
inputs to all standard workload submodules to provide the capability to source environment variables from existing ConfigMaps and Secrets, respectively. -
kube_pg_cluster and kube_redis_sentinel now support using Vault-generated credentials to authenticate from other workloads. See the module documentation for more information.
Fixed
- Adds a controller node preference to pods with
controller_nodes_enabled
set totrue
. This optimizes resource efficiency in the cluster as we should prefer to fill controller (EKS) nodes before Karpenter nodes as controller nodes are not automatically scaled.
edge.24-09-10
Breaking Changes
-
Karpenter has updated its CRD specification which unfortunately requires manual intervention during the upgrade process. After updating the
pf_stack_version
for any deployments of thekube_karpenter_node_pools
module, run the following commands in thekube_karpenter_node_pools
folder:pf-providers-enable terragrunt state rm kubernetes_manifest.default_node_class \ kubernetes_manifest.spot_node_class \ kubernetes_manifest.burstable_node_class \ kubernetes_manifest.burstable_node_pool \ kubernetes_manifest.burstable_arm_node_pool \ kubernetes_manifest.spot_node_pool \ kubernetes_manifest.spot_arm_node_pool \ kubernetes_manifest.on_demand_arm_node_pool \ kubernetes_manifest.on_demand_node_pool terragrunt apply --auto-approve kubectl delete nodepools burstable burstable-arm on-demand on-demand-arm spot spot-arm kubectl delete ec2nc spot burstable on-demand
The
kubectl delete
commands may take a few minutes to complete as this will force all pods to be rescheduled from nodes created using the old CRDs to nodes created using the new CRDs. -
The
ports
input on kube_deployment and kube_stateful_set has been moved to a container-level field rather than a top-level field to better align with the Kubernetes API.
Added
-
Adds a new submodule, kube_service, for defining Kubernetes Services that are optimized for the Panfactum Stack. Additionally, integrates
kube_service
into kube_deployment and kube_stateful_set for automatic Service creation. -
Adds
extra_storage_classes
input to the kube_aws_ebs_csi module.
Fixed
-
Addressed issue in kube_pg_cluster where non-superuser credentials created by Vault would not have access to database schemas other than
public
. -
Addressed issue where our Terragrunt configuration would cause the version pinning for the
goauthentik/authentik
andalekc/kubectl
infrastructure providers would be removed. This would cause issues to occur when users ranterragrunt init -upgrade
to update their lockfiles.
edge.24-09-04
Breaking Changes
-
Before applying this release, the
buildkit-amd64
andbuildkit-arm64
StatefulSets in thebuildkit
namespace must be removed (if kube_buildkit is deployed). -
In preparation for our upcoming release, we cleaned up a handful of naming conventions which impact the inputs and outputs of several modules:
- In kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility:
ready_check_
prefixed fields have been changed toreadiness_probe_
to better align with the actual Kubernetes API.liveness_check_
prefixed fields have been changed toliveness_probe_
to better align with the actual Kubernetes API.image
andimage_version
have been replaced withimage_registry
,image_repository
, andimage_tag
to provide a clearer description of each constituent part and better align with ecosystem conventions.secrets
has been renamed tocommon_secrets
to better align with its counterpart,common_env
.pod_annotations
has been renamed toextra_pod_annotations
to better align with its counterpart,extra_pod_labels
.readonly
and has been renamed toread_only
to better align with our casing conventions.read_only_root_fs
has been renamed toread_only
for better consistency across modules.instance_type_anti_affinity_required
has been renamed toinstance_type_spread_required
to better reflect that the underlying mechanism is a pod topology spread constraint.topology_spread_enabled
has been renamed toaz_spread_preferred
to better reflect actual behavior.topology_spread_required
has been renamed toaz_spread_required
to better reflect actual behavior.zone_anti_affinity_required
has been renamed toaz_anti_affinity_required
to better align naming conventions with other settings that control scheduling based on availability zone.
- Renamed Panfactum-provided priority classes to improve semantics (see docs).
- In kube_pg_cluster and kube_redis_sentinel:
spot_instances_enabled
,arm_instances_enabled
, andburstable_instances_enabled
have been changed tospot_nodes_enabled
,arm_nodes_enabled
, andburstable_nodes_enabled
to better align with the inputs of other modules.
- In kube_constants, a few outputs have been updated:
panfactum_image
has been renamed topanfactum_image_repository
to better align with naming conventions in other Panfactum modulespanfactum_image_version
has been renamed topanfactum_image_tag
to better align with naming conventions in other Panfactum modules
- In kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility:
-
We have removed a handful of options from kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, and kube_workload_utility that we would never recommend using:
prefer_spot_nodes_enabled
,prefer_burstable_nodes_enabled
,prefer_arm_nodes_enabled
: These scheduling preferences are unnecessary as Karpenter will already prefer the cheapest nodes.az_anti_affinity_preferred
:az_spread_preferred
should be used instead.
-
When we introduced the concept of the
enhanced_ha_enabled
input, it was designed as a cost-saving switch for direct modules where users do not need to have a deep understanding of the internals. However, it has also found its way into some submodules where it has created ambiguity about module behavior, especially since its impact differs module-to-module. As a result, we have replaced theenhanced_ha_enabled
input in all submodules with more granular tuning knobs that have clearer behavior. This impacts the following submodules: kube_pg_cluster, kube_redis_sentinel, kube_vault_proxy, kube_argo_event_bus, and kube_argo_event_source. -
Nodes managed by EKS Node Groups (vs Karpenter) are now tainted with
controller=true:NoSchedule
. We have added this taint as pods scheduled on these nodes might be disrupted regardless of their PDBs during EKS upgrades. For some workloads this could cause a disruption. Most workload submodules have a new input,controller_nodes_enabled
, that can be used to allow your workloads to tolerate this taint if desired. -
Previously we were conservative about enabling certain features by default in some of our submodules in order to ensure our modules would be compatible with non-Panfactum Kubernetes clusters. However, this is a very niche use case, and we have observed that this results in extra mental overhead for our normal users to avoid missing out on the core features provided by the Panfactum Stack. As a result:
- The following flags are now enabled by default in kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, kube_pg_cluster, kube_redis_sentinel, and kube_workload_utility:
spot_nodes_enabled
arm_nodes_enabled
vpa_enabled
panfactum_scheduler_enabled
- The following flags are now enabled by default in kube_deployment:
az_spread_preferred
- The following flags are now enabled by default in kube_stateful_set:
az_spread_required
instance_type_spread_required
- The following inputs are now enabled by default in all modules:
pull_through_cache_enabled
- The following inputs are now enabled by default in all direct modules deployed after the autoscaling section in the bootstrapping guide:
vpa_enabled
panfactum_scheduler_enabled
- The following flags are now enabled by default in kube_deployment, kube_stateful_set, kube_cron_job, kube_pod, wf_spec, kube_pg_cluster, kube_redis_sentinel, and kube_workload_utility:
Added
- Adds built-in default downward-api integrations in all our workload submodules.
- All mounted ConfigMaps and Secrets in our workload submodules are now mounted as executable to make it easier to mount scripts.
Fixed
- Updates Karpenter and EBS CSI Controller to prevent any remaining edge cases where nodes were terminated prior to EBS volumes being detached which would result in six-minute delays for rescheduling stateful pods.
- Remove the
RemoveDuplicates
strategy in kube_descheduler as users expect to be able to schedule multiple pods of the same controller on the same node when they sethost_anti_affinity_required
tofalse
.
edge.24-08-27
Breaking Changes
-
We removed the ability to disable S3 backups in kube_pg_cluster. The backups have an extremely low cost impact and significantly improves the durability of data. Moreover, the continuous WAL archiving provided by the backups improves our system's ability to automatically recover in the case of failover events.
Ultimately, we found that the risk of misuse (resulting in unexpected data loss or downtime) significantly outweighed any potential benefits gained by providing this functionality.
Added
-
Added native support for restoring from database backups to the kube_pg_cluster submodule.
-
Added automatic creation of an immediate base backup to the kube_pg_cluster to ensure that new databases can be recovered all the way up to their point of creation.
Fixed
-
Mitigated a rare scenario where disruption in the middle of a database failover would result in the PostgreSQL databases being unable to restart without manual intervention in the kube_pg_cluster submodule.
-
Fixed an issue where
pf-get-repo-variables
would provide the wrong directory for the root of the repository when run inside a downloaded.terragrunt-cache
directory.
edge.24-08-24
Fixed
- Addressed a couple of issues with the kube_authentik
module:
- authentik_core_resources will no longer fail to apply and end up in an invalid state when first created.
- Authentik should no longer experience any downtime during database failover events
edge.24-08-23
Fixed
- Correctly sets PgBouncer permissions on new PostgreSQL cluster creation in kube_pg_cluster.
edge.24-08-22
Breaking Changes
-
The default behavior of kube_redis_sentinel was to use both Redis AOF and RDB for persistence. Unfortunately, using AOF concurrently with RDB negates Redis' the ability to do partial resynchronizations after restarts and failovers. Instead, a full copy of the entire database must be transferred from the current master to replicas on every restart. This greatly increases the time-to-recover as well as incurs a high network cost.
In fact, there is arguably no benefit to AOF-based persistence with our replicated architecture as new Redis nodes will always pull their data from the running master, not from their local AOF. The only benefit would be if all Redis nodes simultaneously failed with a non-graceful shutdown (an incredibly unlikely scenario).
As a result, we have switched the module to use only use RDB for persistence, and the
redis_appendfsync
input has been removed. The module still provides the ability to provide custom redis configuration, so you can re-enable AOF if you want (though we would not advise it). -
token_lifetime_seconds
has been changed totoken_lifetime_hours
in vault_auth_oidc to avoid a perpetual diff issue present in the Vault provider. -
Removed the daily backups from kube_velero as they were undocumented and had no realistic use case.
Added
-
Adds a new submodule, kube_disruption_window_controller, which can be used to specify time-based disruption windows for disruption-sensitive workloads (e.g., databases). Disruption window capabilities have also been added to kube_pg_cluster and kube_redis_sentinel.
-
Adds synchronous replication support to kube_pg_cluster via
pg_sync_replication_enabled
.
Fixed
-
Addressed issue where
pg_smart_shutdown_timeout
cannot be set to 0 in kube_pg_cluster without having CNPG reset it to 180. -
Fixed an issue in kube_velero where stale EBS snapshots were not being deleted.
-
Added stricter disruption prevention to the Velero server in kube_velero as disrupting the server in the middle of a backup operation would cause it to fail and not be resumed.
edge.24-08-15
Breaking Changes
pg_shutdown_timeout
has been renamed topg_smart_shutdown_timeout
to better indicate its purpose in kube_pg_cluster. Additionally, the shutdown and failover logic has been overhauled. The new default will immediately terminate running queries when a database pod is killed, but this serves to reduce the downtime from 60-120 seconds to < 5 seconds in the failover scenario. Please see the module documentation for more information.
Added
-
Adds the concept of passthrough parameters to wf_spec.
-
Makes
tf_apply_dir
a Workflow parameter in wf_tf_deploy so that you only need a single instance of this module per cluster. -
Adds the ability to use
templateRef
to compose Workflows in wf_spec.
Fixed
-
Fixed the working directory in wf_tf_deploy and wf_dockerfile_build to be inside the cloned repository.
-
Addressed OOM errors when using resource templates with wf_spec.
edge.24-08-13
Breaking Changes
-
pg_storage_increase_percent
has been changed topg_storage_increase_gb
in kube_pg_cluster. This allows for more predictable storage autoscaling and optimal resource provisioning regardless of the current storage scale. -
pg_storage_gb
has been changed topg_initial_storage_gb
in kube_pg_cluster. This better indicates that this value is only used during the initial database provisioning and has no effect thereafter. -
node_vpc_id
,node_subnets
, andnode_security_group_id
have been moved from kube_karpenter to kube_karpenter_node_pools in order to simplify the logic of assigning nodes to subnets, VPCs, and security groups. Additionally, we have removed Karpenter auto-discovery tags as they are no longer necessary.
Added
-
Adds new enhancements to the kube_pg_cluster module:
- Better defaults and options for memory tuning
- Provides the ability to set arbitrary PostgreSQL parameters
- Provides the ability to set a custom backup schedule
- Adds support for additional schemas via the
extra_schemas
input
-
Adds another local retry for Terragrunt when providers produce an inconsistent final plan.
-
Adds check for an updated
direnv
version to prevent issues when setting up the local devenv.
Fixed
-
Added deterministic ordering to additional resources in authentik_core_resources.
-
Fixed the following bugs in
pf-env-bootstrap
:- Would use a non-existent AWS profile for the
.sops.yaml
file. - Would not install all the platform checksums in the
.terraform.lock.hcl
files.
- Would use a non-existent AWS profile for the
-
amd64
nodes are now used whenbootstrapping_enabled
istrue
in aws_eks in order to allow certain bootstrapping tests (e.g., Cilium) to run successfully. -
Restores the
pf-db-tunnel
command to the devenv. -
pf-get-version-hash local
now properly returnslocal
without an error code. -
Updates the Panfactum image version in kube_constants to a version that is compatible with the latest pre-built workflows.
edge.24-08-12
Breaking Changes
-
Repository variables must now be defined in a
panfactum.yaml
file located at the root of your repository instead of in yourdevenv.nix
. Additionally, the variables names are no longer prefixed withPF_
and are lowercase.For example,
env.PF_REPO_NAME
indevenv.nix
should now be defined atrepo_name
inpanfactum.yaml
.This change was made to make it easier to reference these values outside of local development contexts such as within CI pipelines where
devenv.nix
isn't loaded.
Added
-
We have provided two new addons, a Workflow Engine (Argo Workflows) and an Event Bus (Argo Events).
-
We have created a guide and best practices for setting up CI / CD in the Panfactum Stack.
-
To support the new addons, we are upgrading the following infrastructure modules to Beta status:
- kube_argo: For deploying the Argo controllers
- kube_argo_event_bus: For deploying an Argo EventBus
- kube_argo_event_source: For deploying an Argo EventSource
- kube_argo_sensor: For deploying an Argo Sensor
- wf_spec: For creating an Argo Workflow specification
- wf_tf_deploy: For creating an Argo WorkflowTemplate that deploys IaC modules
- wf_dockerfile_build: For creating an Argo WorkflowTemplate that builds container images from Dockerfiles
-
Adds
pf-get-repo-variables
which prints a JSON payload of all repository configuration variables with the appropriate defaults set.
edge.24-07-08
Breaking Changes
-
We have made a small, breaking refactor of aws_eks to reduce unnecessary options that made onboarding and maintenance more difficult:
- Most importantly, users will no longer able to set the instance type and count for nodes in EKS node groups. This
flexibility is unnecessary since node provisioning is handled
by Karpenter and not EKS. Moving forward, there are just two static configurations that are guaranteed to work
in all use cases: one for before autoscaling is installed and one for after. This is controlled via the new input,
bootstrap_mode_enabled
(default:false
). control_plane_version
andcontroller_node_kube_version
have been unified into a single variable calledkube_version
that applies to all subsystems.controller_node_subnets
has been renamed tonode_subnets
to indicate these subnets are used for all cluster nodes, not just the EKS node groups.all_nodes_allowed_security_groups
has been renamed tonode_security_groups
to align naming conventions
- Most importantly, users will no longer able to set the instance type and count for nodes in EKS node groups. This
flexibility is unnecessary since node provisioning is handled
by Karpenter and not EKS. Moving forward, there are just two static configurations that are guaranteed to work
in all use cases: one for before autoscaling is installed and one for after. This is controlled via the new input,
-
By default, PVCs created by controllers such as StatefulSets can not be updated through their controller as their template (
volumeClaimTemplates
) is immutable (a Kubernetes limitation). This poses a challenge when needing to update PVC metadata such as annotations and labels. We have built a workaround to this (kube_pvc_annotator) and incorporated it in various Panfactum modules. Unfortunately, incorporating this enhancement requires redeploying StatefulSets.To complete this upgrade, perform the following steps:
-
Create a Velero backup of the cluster by running
velero create backup -w <backup_name>
to recover in case of mistakes. -
The following StatefulSets need to be deleted in this order AND with
kubectl delete --cascade=orphan
AND immediately restored with a subsequentterragrunt apply
to their defining module:- The Vault StatefulSet created by
kube_vault
- The Redis cluster StatefulSet for Authentik created by
kube_authentik
- The BuildKit StatefulSets created by
kube_buildkit
- Any StatefulSets you have provisioned with kube_stateful_set
- Any Redis clusters StatefulSets you have provisioned with kube_redis_sentinel
As long as you use
--cascade=orphan
and take care to minimize the time between thekubectl delete
andterragrunt apply
, there will not be any downtime during this operation. - The Vault StatefulSet created by
-
After completing this operation, you need to delete the backing PVCs from each module one at a time by deleting the PVC and then deleting its bound pod. The controller will then automatically provision a new PVC with the correct labels and annotations to take advantage of the new functionality.
After deleting each pod, ensure that a new pod is automatically provisioned and becomes healthy before proceeding to the next. As long as you proceed one at a time, this will not cause any downtime or data loss.
-
Delete the Velero backup you created in step 1 by running
velero delete backup <backup_name>
.
-
Added
-
Adds kube_fledged to the core stack. The kube-fledged controller adds the ability to pre-pull images to every node to improve pod startup times for critical or frequently used containers such as the Linkerd proxy or database images. We provide instructions for installing this module here
-
Adds the kube_pvc_annotator submodule that will provision a CronJob to run
pf-set-pvc-metadata
against PVCs created by immutable templates. See the module documentation for potential use cases. -
Adds
persistence_backups_enabled
(default:true
) to kube_redis_sentinel to support disabling EBS snapshot backups. -
Adds a new common variable,
node_image_cache_enabled
, to Panfactum modules that can be used to enable pre-pulling images to nodes via thekube_fledged
operator. -
Adds the
pf-buildkit-clear-cache
command for removing any BuildKit caches not being used by an active image build job. -
Adds the
pf-set-pvc-metadata
utility command for syncing labels and annotations across groups of PVCs.
Fixed
-
Fixes handling of public ECR registries in
docker-credential-panfactum
. -
Fixes handling of ECR token caching in
docker-credential-panfactum
. -
Fixes
pf-get-open-port
to be platform-agnostic. -
Fixes
pf-get-version-hash
to work with commit hash inputs. -
Fixes image paths in the Authentik dashboard for applications provisioned by Panfactum modules.
edge.24-07-01
Breaking Changes
-
The input format to aws_ecr_repos has been reformatted to support better per-repository configuration. This should not require replacing any resources, but it will require updating your Terragrunt inputs.
-
The following resources will no longer be tagged with the Panfactum version and commit hash as updates cause unnecessary delays and disruptions during updates for little added value:
- EC2 instances in EKS node groups generated by aws_eks
- EC2 instances serving as NAT hosts in aws_vpc
- KMS replica keys in aws_kms_encrypt_key
- Pods created in kube_bastion
Added
-
kube_buildkit has graduated to beta and is now ready for general consumption. This is the first stack addon that can be used to extend the behavior of the core stack. Installation and usage instructions can be found here.
-
aws_ecr_repos now supports custom image expirations rules and both pull and push permissions.
-
aws_ecr_public_repos has been added to support created public ECR repositories.
-
Adds ARM support in kube_bastion and kube_pvc_autoresizer. All core cluster components can now be run on both amd64 and arm64 nodes allow for optimal cost savings.
-
Changes the default
securityContext.fsGroupChangePolicy
toOnRootMismatch
for Pods created by Panfactum submodules in order to improve PVC mounting performance. -
pf-providers-enable
now ensures that.terraform.lock.hcl
files have all common platform checksums. -
Adds
pf-get-terragrunt-variables
which can be used to derive the Terragrunt variables that would be used if Terragrunt were run in the given directory. -
Adds
pf-tf-delete-locks
which can be used to bulk-release Tofu state locks. -
Adds
pf-sops-set-profile
which will update all sops-encrypted files in the given directory to use the indicated AWS profile for KMS operations. This can be used in CI pipelines to allow the CI user to access sops-encrypted files. -
(Alpha) Adds kube_argo_sensor and kube_argo_event_source submodules for deploying these core components of the Argo Events system.
-
(Alpha) Adds the kube_workflow_spec submodule to help in defining production-ready Argo Workflows.
Fixed
-
kube_aws_ebs_csi has been adjusted to ensure that PVCs are detached from nodes during node shutdown, preventing unnecessary delays in moving PVCs between nodes.
-
kube_core_dns no longer accidentally includes the Vault provider.
-
kube_ingress_nginx will no longer unnecessarily set browser security headers on
3xx
responses or responses that do not haveContent-Type
headers.
edge.24-06-20
Breaking Changes
-
kube_karpenter has upgraded the Karpenter version to
v0.37
. During this release cycle, the Karpenter team moved the CRDs required by Karpenter to a dedicated Helm chart to improve the upgrade ergonomics. Unfortunately, this introduces a few one-time manual steps that you must perform to enable the migration. Specifically, the following commands must be run against your cluster before applying the latest version ofkube_karpenter
:kubectl label crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh app.kubernetes.io/managed-by=Helm --overwrite kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-name=karpenter-crd --overwrite kubectl annotate crd ec2nodeclasses.karpenter.k8s.aws nodepools.karpenter.sh nodeclaims.karpenter.sh meta.helm.sh/release-namespace=karpenter --overwrite
-
kube_karpenter_node_pools has a new input
node_labels
which defines what labels will be applied to generated nodes. The standard Panfactum labeling system will no longer apply to Karpenter nodes due to this upstream issue. -
The
persistence_enabled
option was removed from kube_redis_sentinel. Redis is now always deployed with persistence enabled. This decision was made b/c the cross-AZ network costs of re-instantiating Redis nodes without PVC storage dwarf the costs of the PVC storage (by a factor of 100x). As a result, there is no benefit to not periodically saving the redis database to a persistent disk.To compensate for potential performance impacts, we have exposed another input,
redis_appendfsync
. Setting this to"no"
will achieve the same performance as having persistence disabled. However, the default setting of"everysec"
is likely sufficient for the vast majority of use cases and reduces the risk of data loss.Unfortunately, if you were previously running with
persistence_enabled
set tofalse
, you will need to delete the Redis StatefulSets in order to apply the new module.In particular, this impacts the
kube_authentik
module. Before deleting the Redis StatefulSet for Authentik, ensure your Vault token is not expired as you will not be able to re-authenticate with Authentik while the Redis StatefulSet is removed.Since
persistence_enabled
should only have been used in scenarios where data retention was not important, this should be considered a safe operation. However, it will introduce a minor service disruption during the replacement period. -
aws_ecr_pull_through_cache_addresses has been refactored to improve the ergonomics of using the module. It now requires an input,
pull_through_cache_enabled
, and will output the correct registry names regardless of whether using a pull through cache or not.
Added
-
kube_deployment, kube_stateful_set, kube_cron_job, and kube_pod have graduated to Beta status. They are now safe to use.
-
Adds the
pf-providers-enable
command that will automatically inspect the source infrastructure module and enable the required providers in a module'smodule.yaml
. -
Adds the
pf-update-iac
command that will update first-party infrastructure modules in the following ways:-
Executes the templating directives.
-
Updates the
ref
in sourced Panfactum submodules to the commit hash of the devenv if the# pf-update
annotation is provided. See the documentation for more details.
-
-
Adds phone number validation in aws_account.
-
Adds
cors_enabled
(default:false
) input variable to kube_vault that can enabled CORS handling.This can be useful when building web applications that interact with Vault in client-side JavaScript. By default, this will allow CORS requests from all sibling and child domains.
Fixed
-
Addresses an issue in kube_authentik that prevented the SSO login pop-up from working.
-
Implements custom CORS handling logic in kube_ingress that resolves issues in the default behavior provided by the NGINX ingress controller.
-
Removes invalid failure cases when using
pf-get-vault-token
in Terragrunt and improve failure messaging. -
Fixes an issue that occurs when the
kubernetes
provider is enabled but the sourced module does not use thekubectl
provider. -
Fixes failure cases in
pf-env-scaffold
and adds more debug logging.
edge.24-06-14
Added
-
Adds kube_scheduler, an alternative Kubernetes scheduler that can be used to improve bin-packing of pods on nodes in the Kubernetes cluster. This allows for better, smaller node selection and our tests show an estimated 25-33% reduction in node costs when used. We provide instructions for installing it here.
-
Adds
panfactum_scheduler_enabled
(default:false
) input to most infrastructure modules. When enabled, will use the scheduler provided by kube_scheduler instead of the less-efficient EKS scheduler. -
If
panfactum_scheduler_enabled
istrue
, the kube_descheduler will automatically remove pods from low utilization nodes to allow the kube_scheduler to bin-pack them on other nodes.
Fixed
-
Addresses a bug in the previous release that left kube_karpenter not deployable.
-
Addresses an issue where nodes were limited a hard cap of 29 pods.
-
Configures Kubernetes nodes to use a fixed amount of system overhead rather than one that scales unnecessarily with node size.
edge.24-06-13
Added
-
Updates kube_pg_cluster with many new variables for configuring PgBouncer. New variables are prefixed with
pgbouncer_
. -
Adds support for
path_prefix
to kube_vault_proxy (@mschnee) -
Adds new
enhanced_ha_enabled
input to many core modules (defaulttrue
). Setting this tofalse
will allow for additional cost savings (approximately $50 / month) in exchange for introducing a small possibility of temporary outages. We estimate that setting this tofalse
reduces availability from 99.995% to 99.9%. This can be used to decrease costs in less critical clusters (e.g.,development
). -
Adds a Spot Data Feed to the aws_account module.
-
Adds the kube_open_cost module for calculating the cost of workloads running on Kubernetes.
Fixed
-
Addressed issue in aws_vpc where NAT nodes wouldn't restart if NAT setup failed with an exit code other than
1
. -
Increased the memory floor of the Authentik server in kube_authentik to avoid OOM issues.
-
Updates kube_authentik to allow showing Gravatar profile images.
-
Updates kube_authentik to provide the necessary Permissions-Policy headers to allow use of WebAuthn devices.
-
Correctly applies pod labels in kube_aws_lb_controller.
-
Removes node preferences defaults from kube_workload_utility that were preventing efficient node deprovisioning.
-
Adjusts the VPA recommendation overhead from 30% to 15% to improve resource utilization.
-
Fixes incorrect SCIM property mapping in authentik_aws_sso.
-
Aligns pod labels, affinities, topologySpreadConstraints, and tolerations in kube_linkerd to conventions used in all other modules.
edge.24-06-08
Added
-
Updates aws_vpc to support new command
pf-vpc-network-test
that will verify network connectivity properties of the instantiated VPC. This allows us to simplify an otherwise complex validation step in the bootstrapping guide. -
Adds the
pf-env-bootstrap
command that automatically bootstraps the necessary resources to begin working with IaC in an environment. This replaces the manual steps that used to be a part of the bootstrapping guide. -
Adds new
extra_inputs
terragrunt variable that allows you to pass inputs to all modules in the current scope. -
Adds arm64 NodePools and arm64 support for the core components. This reduces the cost of running the base stack by $25 - 50 / month due to significantly better price / performance ratios for arm64 instances in AWS.
-
Sets
unhealthyPodEvictionPolicy
toAlwaysAllow
for all module PDBs. This will allow the system to scale up quicker when running against resource pressure and pods become stuck in a temporary crash loop. -
Sets maximum node lifetime to 24h to force Karpenter to try to consolidate instances at least once per day.
Fixed
-
Addressed issue where the
aws-ebs-csi-driver
DaemonSet pods would not be properly terminated by Karpenter during node shutdown. This resulted in EBS volumes not being detached and introduced an unnecessary 6min delay when moving EBS volumes between nodes. -
Replaces most usages of
kubernetes_manifest
withkubectl_manifest
to avoid type manifest parsing issues that prevent dynamic values in manifests.
edge.24-06-06
Breaking Changes
- kube_trust_manager has been deprecated as it's functionality was redundant with
kube_reflector. We are keeping the module
in the repo to support backwards compatibility, but it will be removed in the future. You should perform the following steps to remove it:
- Apply this release.
- Remove any dependency blocks to it in your
terragrunt.hcl
files. - Run
terragrunt destroy
on the module to remove it. - Delete the
bundles
CRD.
Added
-
aws_registered_domains can now set the contact type for each contact.
-
Allow users to reference availability zones by single character (e.g.,
a
) in addition to the full name (e.g.,us-east-2a
) in the aws_vpc module. -
The manual steps needed to reset new EKS clusters to a clean slate during the bootstrapping guide have been consolidated into a single new command,
pf-eks-reset
.
Fixed
-
Addressed issue in aws_vpc that caused a temporary, harmless error to crash the
terragrunt apply
on initial bootstrapping. -
Fixed issue where Cilium test suites would fail during bootstrapping due to a NetworkPolicy blocking the kube_core_dns module.
edge.24-06-04
Breaking Changes
-
The reloader deployment must be deleted before the next apply of kube_reloader. No inputs have changed.
-
The alpha module
kube_labels
has been removed in favor of the labels provided by kube_workload_utility. -
VPC flow logs in aws_vpc are now disabled by default as they can be fairly expensive and should only be used if you have a specific use-case in mind. They can be enabled by setting
vpc_flow_logs_enabled
totrue
.
Added
-
Added new
pf-env-scaffold
script that takes care of setting up thePF_ENVIRONMENTS_DIR
in the bootstrapping guide section for setting up terragrunt. -
Added kube_workload_utility to make it easier to create uniform, production-hardened Pod specs that take advantage of all capabilities included in the Panfactum stack.
-
A new standard label
panfactum.com/workload
can be used to group replicated pods for the purpose of aggregating metrics. This is now applied in all core infrastructure modules. -
Added kube_constants that export static configuration values that can be useful when creating resources that run on clusters in the Panfactum stack.
-
kube_cert_manager will now automatically delete Certificate secrets if the Certificate is deleted.
-
aws_ses_domain now takes an optional input
smtp_allowed_cidrs
that restricts what IPs can use the generated SMTP credentials. This allows users to mitigate credential exfiltration attacks. We provide an example of how to use this here. -
The Vault login UI will now have the OIDC login as the default method.
-
Terragrunt will now automatically retry on some errors up to three times before exiting the process with a failure. This should address intermittent issues such as network disruptions or race conditions.
Fixed
-
.env
files are now properly loaded into the shell environment and changes will trigger fast reloads instead of full devenv re-evaluations. -
Temporarily adds
GIT_CLONE_PROTECTION_ACTIVE=false
to the shell environment in order to address this issue. Note that this only disables new bleeding edge security features which were accidentally shipped in a broken state. -
Adjusts base resource requests of core infrastructure modules to prevent temporary OOM errors when bootstrapping before VPA take effect.
-
kube_authentik now respects
log_level
input. -
Sets
max_history
to5
for all Helm charts to prevent overloading the Kubernetes API server with an every-growing amount of historical Helm deployments.
edge.24-06-02
Breaking Changes
- Upgraded to devenv 1.0. As a part of this upgrade,
.env
file values can no longer be referenced directly inside.nix
files.
Added
-
Updated kube_redis_sentinel to automatically limit client buffer size to prevent OOM issues when processing very bursty traffic.
-
Added
pf-update
command that runs all the repository scaffolding commands at once.
Fixed
-
Addressed an issue that caused updates to the local devenv to take at least 10 minutes rebuild on macOS. Rebuilds should now be 10-15x faster, but they will still take about 45 seconds at minimum. Note that this only impacts rebuilds and not normal direnv load times which should still be instant.
This is a known limitation of upstream nix's derivation evaluation caching when using flakes. We expect this to be addressed when flakes reach stability.
-
Added missing defaults for
PF_ENVIRONMENTS_DIR
andPF_IAC_DIR
. -
Resolves an issues where devenv warnings could not be resolved during the initial bootstrapping guide.
-
Added extra validation for the terragrunt variable
extra_tags
. Invalid characters will now be replaced with.
for both keys and values for both Kubernetes labels and AWS tags. -
Fixed some core components that were using all Kubernetes labels for
labelSelector
matching rules which prevented Karpenter from autoscaling whenextra_tags
was provided. This previously manifested as the errorspec.requirements: Too many: #: must have at most 30 items
. -
Added extra constraints to kube_external_dns to prevent it from attempting to query zones that it isn't managing.
-
Prevented kube_external_dns from excluding parent domains of included domains.
edge.24-05-30
Breaking Changes
- The default for
vault_storage_size_gb
in kube_vault has been changed from20
to2
in order to improve resource utilization. If you created Vault with the old default, you will need to manually setvault_storage_size_gb
to20
as volume sizes cannot be reduced after creation.
Added
-
(Alpha) Added the Loki logging backend via kube_logging and the Alloy log collector via kube_alloy.
-
The PVC Autoresizer has been added via the kube_pvc_autoresizer module in order to automatically expand EBS volumes as they fill up. We provide the guide for deploying it here.
-
Added validation for phone number format in aws_registered_domains. (@wesbragagt)
Fixed
- Resolved issue where scheduling constraints could not be resolved for components deployed before Karpenter (#41)
edge.24-05-23
Breaking Changes
-
We have removed the EKS CoreDNS addon and replaced it with the kube_core_dns module in order to provide better guarantees about the behavior of DNS in the Panfactum stack. In order to migrate:
-
Add the
dns_service_ip
input to aws_eks deployments by following this guide. Double check that thedns_service_ip
is the same IP as defined bykube-system/kube-dns
. -
Additionally, set
core_dns_addon_enabled
totrue
. -
Apply the updated module
aws_eks
module. -
Add the
cluster_dns_service_ip
input to your kube_karpenter_node_pools module like this, and re-apply the module. Ensure that all of your nodes have been replaced with the new configuration. -
Deploy
kube_core_dns
by following this guide. Note that this deployment will fail as the original addon service is still running and the IP is already taken. -
Delete
kube-system/kube-dns
and re-applykube_core_dns
. Note that while the service is deleted, DNS will be temporarily unavailable in your cluster. -
Once you've validated that DNS is working in the cluster, remove the
core_dns_addon_enabled
input from theaws_eks
module and re-apply.
-
-
We have stabilized the label selectors in kube_pod but this requires one final label update for already-deployed Deployments. This will cause re-applies of kube_bastion to fail (and any first-party modules that rely on kube_deployment). To resolve, you must first manually delete the
bastion/bastion
deployment (and all other deployments created by kube_deployment). -
kube_pg_cluster has two new flags,
pgbouncer_read_only_enabled
(defaultfalse
) andpgbouncer_read_write_enabled
(defaulttrue
), which will enable ther
andrw
poolers, respectively. This will enable users to better control what is deployed so as not to have idle resources. This is a breaking change aspgbouncer_read_only_enabled
is set tofalse
by default.
Added
-
(Alpha) We've added a monitoring stack kube_monitoring which includes HA Prometheus, the Prometheus Operator, Thanos metrics storage on S3 (with deduplication, caching, and down-sampling), the Node Exporter, kube-state-metrics, Alertmanager, and Grafana (with SSO enabled and 20+ custom dashboards).
Additionally, most modules now have an additional
monitoring_enabled
(defaultfalse
) flag that can be turned on to being shipping data to Prometheus for viewing and querying via Grafana. -
(Alpha) kube_cilium now has a new debugging mode,
hubble_enabled
(defaultfalse
), that will capture extensive TCP-level metrics about the cluster as well as expose a debugging UI via HTTPS. -
(Alpha) kube_linkerd now deploys Linkerd Viz when
monitoring_enabled = true
. This provides a service mesh dashboard and the ability to capture and introspect raw HTTP requests sent in realtime. -
(Alpha) We've added the Argo Workflow engine to the stack via the kube_argo module. This will serve as the basis for the future, integrated CI / CD systems and can also be used to process arbitrary events from event queues such as AWS SNS/SQS and Kafka. (@jlevydev)
-
A new module, kube_vault_proxy, that can be used to add SSO to web assets that do not have integrated SSO. The module SSO is configured out-of-the-box to work with the cluster's Vault instance.
-
We've included a new Kubernetes provider, kubectl, to augment the original kubernetes provider. The
kubectl
provider allows more flexibility in deploying raw Kubernetes manifests which is required by our templating system. This provider will automatically be enabled thekubernetes
provider is enabled, so no additional changes are required from end users. -
kube_redis_sentinel has a new flag,
lfu_cache_enabled
, that will configure the Redis cluster automatically evict records under memory pressure based on an approximated Least Frequently Used algorithm. -
kube_ingress now takes an
extra_configuration_snippet
variable which allows for additional commands in the NGINX configuration snippet.
Changed
-
Added the standard Restricted Reader role to Vault instances (
rbac-restricted-reader
) and updated vault_auth_oidc to takerestricted_reader_groups
. Since cluster resources authenticate with SSO via Vault, this allows restricted readers to access additional cluster resources such as Grafana and Argo Workflows (albeit, in a locked-down read-only mode). -
Disabled evictions of database pods based on max lifetimes. This improves the stability of databases deployed by Panfactum modules.
-
After completing the bootstrapping guide, we now recommend that users update their
aws_eks
cluster modules to havecontroller_node_count
set to1
andcontroller_node_instance_types
set to["t3a.medium"]
. This will decrease the costs of the base cluster by about 40% without impacting cluster availability or resiliency. The single remaining node is used primarily as a place for Karpenter to run (Karpenter cannot run on instances that it itself provisions). -
kube_karpenter now only deploys a single instance of Karpenter and enforces that it is run on a controller node. This reduces the overall resource utilization of this fairly heavyweight controller.
-
Kubernetes labels applied via the
extra_tags
terragrunt input are now sanitized for valid characters automatically (invalid characters are replaced with.
). (@mschnee) -
Added scheduling constraints to prevent critical workloads from scheduling all pods on the same instance type in order to minimize the possibility of disruption on events that only affect one instance type (e.g., spot node preemption).
-
Changes many other non-critical core controllers to only have a single replica when 100% uptime is not necessary in order to reduce resource utilization in the Stack.
-
Updates many controller deployments to use the Recreate deployment strategy to improve timing and efficiency of applying Panfactum upgrades.
-
kube_vpa has a new
history_length_hours
(default24
) that will control how far back it will analyze metrics for computing its recommendations.
Fixed
- PVCs for postgres instances were inadvertently created with duplicated entries for accessModes. This has no functional impact, but confused monitoring systems. This has been fixed, but the fix will not retroactively adjust existing PVCs as they are immutable.
edge.24-05-15
Breaking Changes
- kube_vault now takes
vault_domain
as an input instead ofenvironment_domains
. This change was made as having multiple domains for Vault is incompatible with using Vault as an intermediary IdP.
Added
-
New kube_reflector module for deploying the Reflector in order to synchronize ConfigMaps and Secrets across namespaces. Created a new guide section for deploying the module as a part of the foundational Stack.
-
pg_shutdown_timeout
variable to kube_pg_cluster to control how long the postgres instances will wait for active connections to close before shutting down.
Fixed
- Fixed an issue where simultaneous, graceful shutdown of all postgres nodes in a kube_pg_cluster would cause unnecessary downtime when the primary was running on a spot instance.
edge.24-05-12
The initial edge release of the Panfactum stack!