Deploying Workloads: The Basics

Objective

Learn how to deploy your code and workloads onto the Kubernetes clusters created by the Panfactum Stack.

Prerequisites

Guide for writing and deploying IaC modules

Overview

In the Panfactum Stack, your code will run on the Kubernetes clusters you created in the bootstrapping guide. ¹

The basic process for running first-party code in a Kubernetes cluster is as follows:

Build a container image containing the code your wish to deploy. We recommend using our BuildKit addon to build from Dockerfiles. ²

Create a first-party IaC module that defines the Kubernetes resources you wish to deploy. We provide pre-configured submodules for many of the most common Kubernetes resources:

Kubernetes Type	Submodule	Description
Namespace	kube_namespace	A logical container for Kubernetes resources. Most resources must be assigned a namespace.
Deployment	kube_deployment	Orchestrates stateless workloads such as webservers.
StatefulSet	kube_stateful_set	Orchestrates for stateful workloads such as databases.
CronJob	kube_cron_job	Periodically runs one-shot processes on a schedule such as system maintenance scripts.
DaemonSet	kube_daemon_set	Runs a pod on every tolerated node in the cluster.
Job	kube_job	Runs a one-off process as a part of a module deployment. Useful for pre-/post-deployment scripts such as DB migrations.
Argo Workflow	wf_spec	An advanced pattern available through our Workflow Engine Addon.

Deploy the IaC module by configuring it in the appropriate environment directory and running terragrunt apply.
(Optional) Connect the module deployment to our CI / CD addon.

To see a basic example in action, we recommend you take a look at the IaC for this website:

Core Concepts

Most of the controllers (e.g., Deployments, StatefulSets, etc.) create and manage Kubernetes Pods. ³ As all pods have very similar configuration requirements, we will review the core concepts here.

Containers

A Pod is a colocated group of Containers that all share the same Kubernetes permissions and linux namespaces. ⁴

Most of our controller submodules have a containers input which allows you to configure the containers that will be present in the pods that the controller creates.

There are a few required fields for every defined container:

name: A unique name for the container within the pod (example: foo)
image_registry: The domain name for a container image registry (example: docker.io)
image_repository: The image repository within the registry (example: library/nginx)
image_tag: The tag for a specific image within the registry (example: 1.27.1)
command: The command in list form to execute when starting a container from the given image (example: ["/bin/bash", "-c", "tail -f /dev/null"])

Given the above example values, a pod would be created with a container called foo that uses the image from docker.io/library/nginx:1.27.1 as the root file system and runs /bin/bash -c "tail -f /dev/null". ⁵

For public images, we strongly recommend that you use the ECR pull-through caches for the image_registry (setup as a step in the bootstrapping guide). We provide a submodule for easily referencing the various registry addresses: aws_ecr_pull_through_cache_addresses.

Init Containers

By default, every container in containers is created as a regular container. However, any container has the ability to be an init container by setting init to true on its specification.

Init containers will run to completion before normal containers are started, so they can be used to do one-time initialization work.

That said, init containers do have several drawbacks:

Init containers will block the setup of regular containers. Depending on your configuration, there may be a large delay between when an init container finishes and a regular container starts.
Resources requested by init containers are never reclaimed by Kubernetes even after the init container terminates. For example, if an init container requests 500Mi and the rest of containers only request 100Mi, the pod will always have 500Mi of memory reserved throughout its lifetime even though400Mi is never used. For more information, see this issue.
Vertical pod autoscaling does not apply to init containers, so the resources will always need to be manually adjusted. For more information, see this issue.

Because of these limitations, we do not recommend init containers for anything but the simplest use cases. Instead, consider:

performing any local initiation logic as a part of your main container; or
using a dedicated pod created through a one-shot Job to perform remote tasks like database migrations.

Liveness and Readiness Probes

Every container has the ability to define liveness and readiness probes.

We strongly recommend configuring at least the liveness probe for every regular container. This is critical in allowing the cluster to self-heal if there is an issue with a pod.

The readiness probe can be helpful if you need a mechanism to temporarily stop routing traffic to the container, but is unnecessary beyond that.

We support multiple configuration options under liveness_probe_ and readiness_probe_ prefixes (see the module docs for more information).

We also automatically create startup probes based on the configuration for the liveness probe.

Resource Management

Every container has the ability to set resource requests and limits which aid in pod-to-node scheduling and cluster stability. The two most common resources are CPU and memory which should be configured for every container. In particular,

Every container should set both CPU and memory requests as this ensures that the pod will only be assigned to a node that has those resources available.
Every container should also set a memory limit as this ensures that memory leaks will not have unmitigated impact on the other containers colocated on the same node.
CPU limits should be avoided as they are almost always unnecessary. CPU is shared across all containers in an amount proportionate to each container’s CPU request. Setting a CPU limit will cause unnecessary throttling and reduce your overall resource utilization.

In Panfactum clusters, you do not need to manually manage these settings; they are taken care of for you by the vertical pod autoscaler. ⁶ However, you can set minimums and maximums to the autoscaling ranges via minimum_memory, maximum_memory, minimum_cpu, and maximum_cpu. We generally recommend setting minimums but leaving the maximums unset.

Additionally, you can configure the memory_limit_multiplier which controls how much “extra” memory a container may use over its request before it is OOM-killed. The default multiplier is 1.3.

Security

Containers created by Panfactum submodules are secured by default in the following ways:

They are not allowed to run as root and by default run as uid 1000. This can be disabled on a per-container basis by setting run_as_root to true.
All linux capabilities are dropped by default. You can add capabilities on a per-container basis through the linux_capabilities input.
Every container has a ready-only root filesystem. If you need to write to disk, we recommend using temporary directories (see below) or a kube_stateful_set. However, you can override the security measure by setting read_only to false.
The pod security standard is set to baseline (not privileged). The privileged standard can be enabled on a part-container basis by setting privileged to true.

Pod Scheduling

The process of assigning pods to nodes is called pod scheduling. How your pods are scheduled will have a significant impact on your workload’s availability and cost to run. The Panfactum controller submodules expose the following inputs that control scheduling behavior.

Instance Spread

In all clusters, you should be aware of how pods are spread across individual instances and instance classes.

host_anti_affinity_required: If true, pods of the same controller are prevented from being scheduled on the same node. Given that we regularly terminate nodes for maintenance and scaling operations, this is enabled by default to avoid workload disruptions. We do not recommend setting this to false unless termination of all pods in a controller such as a Deployment would not cause a noticeable service disruption.
instance_type_anti_affinity_required: If true, pods of the same controller will be prevented from running on the same instance type (e.g., t4g.medium).
This provides extra resilience in the following scenarios:
- Spot disruptions: Spot disruptions often impact all spot nodes of a single instance type, so you would want to avoid having all pods scheduled on nodes of the same type if spot_nodes_enabled is true.
- Instance type failure: EC2 instances in AWS are all virtualized. It is possible that an AWS software update may impact the functionality of certain instance types while leaving other unscathed. Spreading pods across instance types avoids this failure case.

Note that enabling any of these options will increase the cost of running your workload by lowering overall resource utilization. As a result, we recommend instance_type_anti_affinity_required be enabled only in production environments and disabling host_anti_affinity_required if possible. See our high availability guide for more information.

Geographic Spread

For clusters that span multiple AWS availability zones (AZs), you should be aware of how pods created by a single controller are spread across those AZs.

az_spread_preferred: If true, pods will have a topology spread constraint that balances across availability zones with a whenUnsatifiable: ScheduleAnyway policy. This should be used for Deployments when you want to protect against a single AZ outage, but should not be used for StatefulSets as it may result in a permanent zone imbalance for StatefulSet pods. ⁷

az_spread_required: If true, pods will have a topology spread constraint that balances across availability zones with a whenUnsatifiable: DoNotSchedule policy. This should be used for StatefulSets when you want to protect against a single AZ outage. This takes precedence over az_spread_preferred.
az_anti_affinity_required: If true, no two pods of the same controller will ever be scheduled in the same AZ. This is the most extreme scheduling constraint and should not be used unless you have fewer pods than AZs and the number of pods is static.

Note that enabling any of these options will increase the cost of running your workload by lowering overall resource utilization. As a result, we recommend az_spread_preferred or az_spread_required be enabled only in production environments. See our high availability guide for more information.

Node Classes

We have several different node classes in the Panfactum stack. The default node class is an AMD64 (x86) On-Demand EC2 instance. This is what your pod will be scheduled on if it does not tolerate the taints of other node classes. However, this is also the most expensive class of nodes, often 10x more expensive than needed for most workloads.

We provide the ability to run on cheaper node classes by enabling the following:

spot_nodes_enabled: If true, pods will be allowed to run on Spot instances which can be 50-70% cheaper than On-Demand instances. The tradeoff is that spot instances can be terminated at any time with only a two-minute notice. This is enabled by default as most workloads can gracefully terminate in under two minutes, but if your workload cannot or if it cannot tolerate arbitrary pod disruptions, you should set this to false. ⁸
burstable_nodes_enabled: If true, pods will be allowed to run on Burstable instances. Burstable instances can be a good fit if the workload has low average CPU utilization with the occasional peak. In these scenarios, these instances can be 10-15% cheaper than their M-type counterparts. However, if your average CPU utilization is routinely above ~30%, these instances will actually cost more money as they are run in unlimited mode to prevent unexpected disruptions.
arm_nodes_enabled: If true, pods will be allowed to run on ARM64 instances which tend to be 20-30% cheaper than AMD64 instances. As most container images today are multi-platform, this is enabled by default. If your workload can only run on AMD64 CPU architectures, you should set this to false. ⁹

Node classes are not mutually exclusive. You can have multiple enabled at once to combine their benefits.

We also provide one additional node class: controller nodes. These nodes are special in that they are managed by AWS EKS Node Groups rather than by Karpenter. These are required because some of the core cluster utilities cannot run on nodes provisioned by Karpenter (such as Karpenter itself). By default, your workloads will not be allowed to run on them because they have special lifecycle rules that might be disruptive to your workloads. However, if you have the need, you can allow your workloads to run on these nodes by setting controller_nodes_enabled to true. Note that this will also add spot and burstable tolerations to generated pods since the controller nodes are burstable, spot instances.

Do not run critical workloads on controller nodes unless they can tolerate arbitrary, non-graceful disruptions.

Node Requirements

Sometimes you may want to guarantee that pod is scheduled on a certain type of node. You can accomplish this by setting the node_requirements input.

Panfactum Scheduler

In the bootstrapping guide, your cluster administrator deployed the Panfactum bin-packing scheduler. This enables better bin-packing of pods onto nodes, improving resource utilization and lower your cluster costs.

However, if you need to opt-out of the bin-packing scheduler, you can set panfactum_scheduler_enabled to false.

Pod Disruptions

There are two ways classes of reasons that a pod might be terminated / disrupted: involuntary and voluntary disruptions.

Involuntary Disruptions

Involuntary disruptions occur when a pod is forced to terminate. Oftentimes these disruptions cannot be predicted in advance, and you will have no ability to prevent or delay the pod termination. The most common involuntary disruptions are:

Hardware or network failures on the underlying node that prevent the pod from operating as expected.
Spot interruptions (if using spot_nodes_enabled) or other forced node shutdown scenarios; pods on the node will get approximately two minutes to terminate gracefully.
Node resource exhaustion; if all pods on a node do not correctly set their resource requests and limits, it is possible that the resources on a node can be exhausted and pods will be evicted from the pod to make room based on their priority class.
Pod preemption; if not enough resources exist in a cluster because node autoscaling is falling behind, pods with a lower priority class will be terminated to make room for pods with a higher priority class.
Rollouts; when updating a pod spec in a controller such as a Deployment, the controller will terminate the old pods according to its update strategy.

Voluntary Disruptions

Voluntary disruptions occur when a pod is terminated via the eviction API. Voluntary disruptions can be prevented by using a Pod Disruption Budget (PDB). Voluntary disruptions include:

Resource rightsizing done by the Vertical Pod Autoscaler.
Node scale-in or restarts performed by Karpenter.
Evictions executed by the Descheduler according to our self-healing and re-balancing policies.
Any other scripts that utilize the eviction API; almost all utilities that can terminate pods use the eviction API so that they respect PDBs.

All of our controller submodules such as kube_deployment will automatically create an appropriate PDB; however, there are a few configurable parameters set via module inputs:

unhealthy_pod_eviction_policy: Sets the unhealthy pod eviction policy; the Panfactum default is AlwaysAllow in order to allow the cluster to self-heal if pods end up in a CrashLoopBackOff state.
max_unavailable: Sets the maxUnavailable field of the PDB. The Panfactum default is 1 in order to allow pods to be occasionally evicted as a part of routine cluster maintenance activities. Setting this to 0 will disable voluntary evictions entirely.

Note that PDBs will NOT protect pods that have not yet reached the Running state (e.g., the pod is initializing). We have added custom logic that prevents most evictions for recently created pods to avoid disrupting pods that have not reached a healthy state, but your pods must fully initialize within five minutes to avoid potential issues.

Graceful Termination

Regardless of how a pod is terminated or disrupted, it will be given an opportunity to gracefully shutdown before being forcefully terminated. Containers in the pod will first receive a SIGTERM signal to indicate they must shut down which provides them the opportunity to complete any in-flight work such as responding to received requests or closing database connections. During this time the pod will not be able to receive new network traffic, but it will be able to initiate outbound requests.

If the pod isn’t shutdown after a short delay, the pod will then be forcibly killed as a SIGKILL signal will be sent to all running processes. The delay between the SIGTERM and SIGKILL is the terminationGracePeriodSeconds which can be set on our modules via termination_grace_period_seconds. The default for our modules is 90 seconds.

More detailed information on the pod lifecycle can be found here.

terminationGracePeriodSeconds is only respected when the underlying node is in a running state.

If the underlying node is terminating, there is an upper bound to how long a pod’s effective graceful termination period will be. That bound depends on how the node was terminated.

The node was terminated via the host operating system shutting down: For spot instances, the upper bound is 30 seconds for critical pods and 90 seconds for non-critical pods. For on-demand instances, the upper bound is 30 seconds for critical pods and 59.5 minutes for non-critical pods. Additionally, non-critical pods will terminate before critical pods due to the node shutdown sequencing.
The node was disrupted by Karpenter (via autoscaling or manual disruption): For spot instances, the upper bound for all pods is 2 minutes. For on-demand instances, the upper bound for all pods is 1 hour. However, all pods will begin terminating at once and there will be no differentiation between critical and non-critical pods. See more here and here.

Our overall recommendation is the following. If you are scheduling pods on spot instances, the terminationGracePeriodSeconds should not exceed 90 seconds. If you are scheduling pods on on-demand instances, the terminationGracePeriodSeconds should not exceed 59.5 minutes.

Additionally, if you are adding system-node-critical pods to the stack, you should ensure they contain the appropriate logic to delay shutdown until all non-system-node-critical pods have been terminated.

Accessing Configuration Values at Runtime

There are two common ways to pass configuration to your running contains: environment variables and files.

Environment Variables

To pass environment variables to your containers, each of our controller submodules provides several inputs:

common_env: A key-value mapping of plaintext values that will be set as environment variables.
common_secrets: A key-value mapping of secret values. The values will be stored in a Kubernetes secret so as not to be exposed to cluster users who only have restricted_reader access (see RBAC reference).
common_env_from_config_maps: A key-configuration mapping of environment variables that will be set to a value in an existing Kubernetes ConfigMap. ¹⁰
common_env_from_secrets: A key-configuration mapping of environment variables that will be set to a value in an existing Kubernetes Secret. ¹⁰

All containers will additionally have the following environment variables set by default: ¹¹

POD_IP: The IP address assigned to the pod.
POD_NAME: The name of the pod.
POD_NAMESPACE: The namespace of the pod.
POD_SERVICE_ACCOUNT: The name of the pod’s service account.
POD_TERMINATION_GRACE_PERIOD_SECONDS: The pod’s terminationGracePeriodSeconds field.
CONTAINER_IMAGE: The container’s image field.
CONTAINER_IMAGE_TAG: The container’s image’s tag.
CONTAINER_IMAGE_REPO: The container’s image’s repository within the registry.
CONTAINER_IMAGE_REGISTRY: The container’s image’s registry.
CONTAINER_CPU_REQUEST: The number of CPU cores requested by the container.
CONTAINER_MEMORY_REQUEST: The number of bytes of RAM requested by the container.
CONTAINER_MEMORY_LIMIT: The number of bytes of RAM allowed to be used by the container before an OOM error.
CONTAINER_EPHEMERAL_STORAGE_REQUEST: The number of bytes of ephemeral storage from the node requested by the container.
CONTAINER_EPHEMERAL_STORAGE_LIMIT: The number of bytes of ephemeral storage from the node that the container is allowed to use before the pod is evicted.
NODE_IP: The IP address assigned to the node the pod is scheduled on.
NODE_NAME: The name of the node the pod is scheduled on.

Mounted Files

To mount files inside your containers, each of our controller submodules providers two inputs:

config_map_mounts: A mapping of ConfigMap names to their mount configuration inside each container.
secret_mounts: A mapping of Secret names to their mount configuration inside each container.

Mounted Secrets and ConfigMaps need to inside the pod’s namespace. ¹² The file contents inside each container will be automatically updated if their values change in the source resource.

All containers created by Panfactum submodules will additionally have files mounted under /etc/podinfo which will contain additional metadata about the pod:

labels: The pod’s labels
annotations: The pod’s annotations

These files will be filled with newline-delimited entries with the format <key>="<value>".

Kubernetes API

In case you need to access the full pod manifest from inside one of the pod’s containers, every container created by a Panfactum submodule is authorized to read its own pod manifest via Kubernetes API (e.g., kubectl --namespace $POD_NAMESPACE get pod $POD_NAME --output yaml).

Temporary Directories

As containers are created in read-only mode, you will need to create temporary directories to be able to write to the local file system. ¹³

Our controller submodules provide an input to take care of this provisioning: tmp_directories.

All created directories are size-limited by the size_mb field on each directory configuration object.

Directories may either be node-local or EBS-backed (the default). This is controlled via the node_local boolean field. If you only need a very small amount of space, you can set this to true to improve your pod’s startup time by a few seconds. Otherwise, we recommend you use the default EBS-backed directories to avoid exhausting a node’s limited disk space.

Labels and Annotations

Pod created by Panfactum submodules will automatically be labeled with our standard resource tags.

For a specific workload, you can add additional labels and annotations via the extra_pod_labels and extra_pod_annotations inputs, respectively.

If you need to add a set of labels and/or annotations globally to all pods in a given Kubernetes cluster, you can utilize the common_pod_labels and common_pod_annotations inputs of the kube_policies module which was set up during cluster bootstrapping.

Priority Class

Kubernetes priority classes define precedence for pod preemption. Our kube_policies module installs a few additional priority classes in addition to the Kubernetes defaults.

The following priority classes are included in the Panfactum stack (ordered by precedence):

system-node-critical (2000001000): Pod is required for the node it is scheduled on to function.
system-cluster-critical (2000000000): Pod is required for the cluster itself to function.
cluster-important (100000000): Losing this pod would leave the cluster in a degraded state; some functionality would be lost.
workload-important (10000000): Disrupting this pod might leave certain workloads in a degraded state.
default (0): The default priority assigned to pods.

You can create additional priority classes via kube_priority_classes as your needs require.

Pods created by our controller submodules can have their priority classes set via the priority_class_name input.

Permissions

Permissions in Kubernetes are assigned via a pod’s Service Account this includes not just permissions to access the Kubernetes API but also permissions to access external systems such as AWS and Vault.

All our controller submodules will automatically create a dedicated Service Account for the pods managed by the controller and provide its name via the service_account_name output.

You can use the output to assign additional permissions to the pods by following our guides here.

Caching Images on Nodes

By default, a pod’s images are pulled just-in-time to the nodes where the pod is assigned. This can result in a significant delay to startup time, especially for large images.

We provide a custom controller that will automatically cache images on nodes before the pods are scheduled to ensure that they are available as soon as possible. Panfactum workload submodules already have implicit integration with this controller, and you can control the image cache settings on a per-container basis through the containers input to each workload submodule.

In particular, the container configuration object has two relevant fields:

image_prepull_enabed: Iff true, will schedule a pod to seed the node cache of new nodes immediately upon node creation. (default: true)
image_pin_enabled: Iff true, will pin the image to the node to prevent it from ever being garbage collected. (default: true)

If you are not using a Panfactum workload submodule, you can add images directly to the node cache by using the kube_node_image_cache submodule.

Custom Workload Deployments

All the above information applies specifically to the Panfactum submodules for deploying workloads (e.g., kube_deployment). However, nothing prevents you from deploying workloads to Kubernetes directly without using our modules. Using Panfactum-flavored Kubernetes does not prevent you from doing anything that you could not do using stock Kubernetes.

A common use case is to use controllers that we do not provide by default such as Argo Rollouts.

Keep in mind that if you roll your own modules for deploying resources on Kubernetes, you will be responsible for the dozens of enhancements that are built into our submodules by default. This includes everything from security to autoscaling to high availability and much more.

The exception would be services running on edge compute. Coming soon. ↩
Alternatively, you can use a prebuilt image from a third-party. ↩
In fact, our submodules utilize a shared interface for defining pods: kube_pod. ↩
Not to be confused with Kubernetes namespaces. ↩
tail -f /dev/null is used because this is a noop that prevents the command from terminating. If the container terminated, some controllers like a Deployment would continually try to restart it. ↩
This is only true if using a Panfactum-provided submodule and does not apply to Argo Workflows. ↩
Once a numbered StatefulSet pod is scheduled in a zone, it will always be scheduled in that zone. ScheduleAnyway allows for temporary imbalances to occur, but for a StatefulSet the imbalance will become permanent. ↩
Spot tolerations are added to all pods by default, even those not created by Panfactum submodules. You can add the label "panfactum.com/spot-enabled" = "false" to the pod to prevent spot tolerations from being injected. ↩
ARM64 tolerations are added to all pods by default, even those not created by Panfactum submodules. You can add the label "panfactum.com/arm-enabled" = "false" to the pod to prevent spot tolerations from being injected. ↩
Unlike in base Kubernetes, workloads will be automatically restarted if the referenced ConfigMap or Secret value changes. ↩ ↩²
The environment variables are added to all containers by default, even those not created by Panfactum submodules. You can add the label "panfactum.com/inject-env-enabled" = "false" to the pod to prevent these environment variables from being injected into its containers. ↩
If you need to copy the same Secret or ConfigMap across multiple namespaces, see the Kubernetes Reflector documentation (already installed in cluster as a part of the bootstrapping guide). ↩
Unless you are using kube_stateful_set which adds support for persistent volumes which will retain their data across pod restarts. ↩