Deploying Workloads: The Basics
Objective
Learn how to deploy your code and workloads onto the Kubernetes clusters created by the Panfactum Stack.
Prerequisites
Overview
In the Panfactum Stack, your code will run on the Kubernetes clusters you created in the bootstrapping guide. 1
The basic process for running first-party code in a Kubernetes cluster is as follows:
-
Build a container image containing the code your wish to deploy. We recommend using our BuildKit addon to build from Dockerfiles. 2
-
Create a first-party IaC module that defines the Kubernetes resources you wish to deploy. We provide pre-configured submodules for many of the most common Kubernetes resources:
Kubernetes Type Submodule Description Namespace kube_namespace A logical container for Kubernetes resources. Most resources must be assigned a namespace. Deployment kube_deployment Orchestrates stateless workloads such as webservers. StatefulSet kube_stateful_set Orchestrates for stateful workloads such as databases. CronJob kube_cron_job Periodically runs one-shot processes on a schedule such as system maintenance scripts. DaemonSet kube_daemon_set Runs a pod on every tolerated node in the cluster. Argo Workflow wf_spec An advanced pattern available through our Workflow Engine Addon. -
Deploy the IaC module by configuring it in the appropriate environment directory and running
terragrunt apply
. -
(Optional) Connect the module deployment to our CI / CD addon.
Core Concepts
Most of the controllers (e.g., Deployments, StatefulSets, etc.) create and manage Kubernetes Pods. 3 As all pods have very similar configuration requirements, we will review the core concepts here.
Containers
A Pod is a colocated group of Containers that all share the same Kubernetes permissions and linux namespaces. 4
Most of our controller submodules have a containers
input which allows you to configure the containers that will
be present in the pods that the controller creates.
There are a few required fields for every defined container:
name
: A unique name for the container within the pod (example:foo
)image_registry
: The domain name for a container image registry (example:docker.io
)image_repository
: The image repository within the registry (example:library/nginx
)image_tag
: The tag for a specific image within the registry (example:1.27.1
)command
: The command in list form to execute when starting a container from the given image (example:["/bin/bash", "-c", "tail -f /dev/null"]
)
Given the above example values, a pod would be created with a container called foo
that uses the image from docker.io/library/nginx:1.27.1
as
the root file system and runs /bin/bash -c "tail -f /dev/null"
. 5
Init Containers
By default, every container in containers
is created as a regular container. However, any container has the ability
to be an init container by setting init
to true
on its specification.
Init containers will run to completion before normal containers are started, so they can be used to do one-time initialization work.
That said, init containers do have several drawbacks:
-
Init containers will block the setup of regular containers. Depending on your configuration, there may be a large delay between when an init container finishes and a regular container starts.
-
Resources requested by init containers are never reclaimed by Kubernetes even after the init container terminates. For example, if an init container used
500Mi
and the rest of containers only use100Mi
, the pod will always have500Mi
of memory reserved throughout its lifetime even though400Mi
is never used. For more information, see this issue. -
Vertical pod autoscaling does not apply to init containers, so the resources will always need to be manually adjusted. For more information, see this issue.
Because of these limitations, we do not recommend init containers for anything but the simplest use cases. Instead, consider:
- performing any local initiation logic as a part of your main container; or
- using a dedicated pod created through a one-shot Job to perform remote tasks like database migrations.
Liveness and Readiness Probes
Every container has the ability to define liveness and readiness probes.
We strongly recommend configuring at least the liveness probe for every regular container. This is critical in allowing the cluster to self-heal if there is an issue with a pod.
The readiness probe can be helpful if you need a mechanism to temporarily stop routing traffic to the container, but is unnecessary beyond that.
We support multiple configuration options under liveness_probe_
and readiness_probe_
prefixes (see the
module docs for more information).
We also automatically create startup probes based on the configuration for the liveness probe.
Resource Management
Every container has the ability to set resource requests and limits which aid in pod-to-node scheduling and cluster stability. The two most common resources are CPU and memory which should be configured for every container. In particular,
-
Every container should set both CPU and memory requests as this ensures that the pod will only be assigned to a node that has those resources available.
-
Every container should also set a memory limit as this ensures that memory leaks will not have unmitigated impact on the other containers colocated on the same node.
-
CPU limits should be avoided as they are almost always unnecessary. CPU is shared across all containers in an amount proportionate to each container's CPU request. Setting a CPU limit will cause unnecessary throttling and reduce your overall resource utilization.
In Panfactum clusters, you do not need to manually manage these settings; they are taken care of for you by the vertical
pod autoscaler. 6 However, you can set minimums and maximums to the autoscaling ranges via minimum_memory
, maximum_memory
,
minimum_cpu
, and maximum_cpu
. We generally recommend setting minimums but leaving the maximums unset.
Additionally, you can configure the memory_limit_multiplier
which controls how much "extra" memory a container may
use over its request before it is OOM-killed. The default multiplier is 1.3
.
Security
Containers created by Panfactum submodules are secured by default in the following ways:
-
They are not allowed to run as root and by default run as uid
1000
. This can be disabled on a per-container basis by settingrun_as_root
totrue
. -
All linux capabilities are dropped by default. You can add capabilities on a per-container basis through the
linux_capabilities
input. -
Every container has a ready-only root filesystem. If you need to write to disk, we recommend using temporary directories (see below) or a kube_stateful_set. However, you can override the security measure by setting
read_only
to false. -
The pod security standard is set to baseline (not privileged). The privileged standard can be enabled on a part-container basis by setting
privileged
totrue
.
Pod Scheduling
The process of assigning pods to nodes is called pod scheduling. How your pods are scheduled will have a significant impact on your workload's availability and cost to run. The Panfactum controller submodules expose the following inputs that control scheduling behavior.
Instance Spread
In all clusters, you should be aware of how pods are spread across individual instances and instance classes.
-
host_anti_affinity_required
: Iftrue
, pods of the same controller are prevented from being scheduled on the same node. Given that we regularly terminate nodes for maintenance and scaling operations, this is enabled by default to avoid workload disruptions. We do not recommend setting this tofalse
unless termination of all pods in a controller such as a Deployment would not cause a noticeable service disruption. -
instance_type_anti_affinity_required
: If true, pods of the same controller will be prevented from running on the same instance type (e.g.,t4g.medium
).This provides extra resilience in the following scenarios:
-
Spot disruptions: Spot disruptions often impact all spot nodes of a single instance type, so you would want to avoid having all pods scheduled on nodes of the same type if
spot_nodes_enabled
is true. -
Instance type failure: EC2 instances in AWS are all virtualized. It is possible that an AWS software update may impact the functionality of certain instance types while leaving other unscathed. Spreading pods across instance types avoids this failure case.
-
Note that enabling any of these options will increase the cost of running your workload by lowering overall
resource utilization. As a result, we recommend instance_type_anti_affinity_required
be enabled only in production environments and disabling host_anti_affinity_required
if possible.
See our high availability guide for more information.
Geographic Spread
For clusters that span multiple AWS availability zones (AZs), you should be aware of how pods created by a single controller are spread across those AZs.
az_spread_preferred
: Iftrue
, pods will have a topology spread constraint that balances across availability zones with awhenUnsatifiable: ScheduleAnyway
policy. This should be used for Deployments when you want to protect against a single AZ outage, but should not be used for StatefulSets as it may result in a permanent zone imbalance for StatefulSet pods. 7
-
az_spread_required
: Iftrue
, pods will have a topology spread constraint that balances across availability zones with awhenUnsatifiable: DoNotSchedule
policy. This should be used for StatefulSets when you want to protect against a single AZ outage. This takes precedence overaz_spread_preferred
. -
az_anti_affinity_required
: Iftrue
, no two pods of the same controller will ever be scheduled in the same AZ. This is the most extreme scheduling constraint and should not be used unless you have fewer pods than AZs and the number of pods is static.
Note that enabling any of these options will increase the cost of running your workload by lowering overall
resource utilization. As a result, we recommend az_spread_preferred
or az_spread_required
be enabled only in production environments.
See our high availability guide for more information.
Node Classes
We have several different node classes in the Panfactum stack. The default node class is an AMD64 (x86) On-Demand EC2 instance. This is what your pod will be scheduled on if it does not tolerate the taints of other node classes. However, this is also the most expensive class of nodes, often 10x more expensive than needed for most workloads.
We provide the ability to run on cheaper node classes by enabling the following:
-
spot_nodes_enabled
: Iftrue
, pods will be allowed to run on Spot instances which can be 50-70% cheaper than On-Demand instances. The tradeoff is that spot instances can be terminated at any time with only a two-minute notice. This is enabled by default as most workloads can gracefully terminate in under two minutes, but if your workload cannot or if it cannot tolerate arbitrary pod disruptions, you should set this tofalse
. 8 -
burstable_nodes_enabled
: Iftrue
, pods will be allowed to run on Burstable instances. Burstable instances can be a good fit if the workload has low average CPU utilization with the occasional peak. In these scenarios, these instances can be 10-15% cheaper than theirM
-type counterparts. However, if your average CPU utilization is routinely above ~30%, these instances will actually cost more money as they are run in unlimited mode to prevent unexpected disruptions.Note that in the Panfactum stack, Burstable instances are also Spot instances, so setting
burstable_nodes_enabled
totrue
will implicitly setspot_nodes_enabled
totrue
. This is because we assume workloads scheduled on Burstable instances will not require 100% guaranteed, persistent resource allocations and which almost always means they are safe for Spot instances as well. -
arm_nodes_enabled
: Iftrue
, pods will be allowed to run on ARM64 instances which tend to be 20-30% cheaper than AMD64 instances. As most container images today are multi-platform, this is enabled by default. If your workload can only run on AMD64 CPU architectures, you should set this tofalse
. 9
We also provide one additional node class: controller nodes. These nodes are special in that they are managed by
AWS EKS Node Groups rather than by Karpenter.
These are required because some of the core cluster utilities cannot run on nodes provisioned by Karpenter (such
as Karpenter itself). By default, your workloads will not be allowed to run on them because they have special
lifecycle rules that might be disruptive to your workloads.
However, if you have the need, you can allow your workloads to run on these
nodes by setting controller_nodes_enabled
to true
.
Node Requirements
Sometimes you may want to guarantee that pod is scheduled on a certain type of node. You can accomplish
this by setting the node_requirements
input.
Panfactum Scheduler
In the bootstrapping guide, your cluster administrator deployed the Panfactum bin-packing scheduler. This enables better bin-packing of pods onto nodes, improving resource utilization and lower your cluster costs.
However, if you need to opt-out of the bin-packing scheduler, you can set panfactum_scheduler_enabled
to false
.
Pod Disruptions
There are two ways classes of reasons that a pod might be terminated / disrupted: involuntary and voluntary disruptions.
Involuntary Disruptions
Involuntary disruptions occur when a pod is forced to terminate. Oftentimes these disruptions cannot be predicted in advance, and you will have no ability to prevent or delay the pod termination. The most common involuntary disruptions are:
-
Hardware or network failures on the underlying node that prevent the pod from operating as expected.
-
Spot interruptions (if using
spot_nodes_enabled
) or other forced node shutdown scenarios; pods on the node will get approximately two minutes to terminate gracefully. -
Node resource exhaustion; if all pods on a node do not correctly set their resource requests and limits, it is possible that the resources on a node can be exhausted and pods will be evicted from the pod to make room based on their priority class.
-
Pod preemption; if not enough resources exist in a cluster because node autoscaling is falling behind, pods with a lower priority class will be terminated to make room for pods with a higher priority class.
-
Rollouts; when updating a pod spec in a controller such as a Deployment, the controller will terminate the old pods according to its update strategy.
Voluntary Disruptions
Voluntary disruptions occur when a pod is terminated via the eviction API. Voluntary disruptions can be prevented by using a Pod Disruption Budget (PDB). Voluntary disruptions include:
- Resource rightsizing done by the Vertical Pod Autoscaler.
- Node scale-in or restarts performed by Karpenter.
- Evictions executed by the Descheduler according to our self-healing and re-balancing policies.
- Any other scripts that utilize the eviction API; almost all utilities that can terminate pods use the eviction API so that they respect PDBs.
All of our controller submodules such as kube_deployment will automatically create an appropriate PDB; however, there are a few configurable parameters set via module inputs:
-
unhealthy_pod_eviction_policy
: Sets the unhealthy pod eviction policy; the Panfactum default isAlwaysAllow
in order to allow the cluster to self-heal if pods end up in aCrashLoopBackOff
state. -
max_unavailable
: Sets the maxUnavailable field of the PDB. The Panfactum default is1
in order to allow pods to be occasionally evicted as a part of routine cluster maintenance activities. Setting this to0
will disable voluntary evictions entirely.
Graceful Termination
Regardless of how a pod is terminated or disrupted, it will be given an opportunity to gracefully shutdown
before being forcefully terminated. Containers in the pod will first receive a SIGTERM
signal to indicate
they must shut down which provides them the opportunity to complete any in-flight work such as responding
to received requests or closing database connections. During this time the pod will not be able to receive
new network traffic, but it will be able to initiate outbound requests.
If the pod isn't shutdown after a short delay,
the pod will then be forcibly killed as a SIGKILL
signal will be sent to all running processes.
The delay between the SIGTERM
and SIGKILL
is the terminationGracePeriodSeconds
which can be set on our
modules via termination_grace_period_seconds
. The default for our modules is 90 seconds.
More detailed information on the pod lifecycle can be found here.
Accessing Configuration Values at Runtime
There are two common ways to pass configuration to your running contains: environment variables and files.
Environment Variables
To pass environment variables to your containers, each of our controller submodules provides several inputs:
common_env
: A key-value mapping of plaintext values that will be set as environment variables.common_secrets
: A key-value mapping of secret values. The values will be stored in a Kubernetes secret so as not to be exposed to cluster users who only haverestricted_reader
access (see RBAC reference).common_env_from_config_maps
: A key-configuration mapping of environment variables that will be set to a value in an existing Kubernetes ConfigMap. 10common_env_from_secrets
: A key-configuration mapping of environment variables that will be set to a value in an existing Kubernetes Secret. 10
All containers will additionally have the following environment variables set by default: 11
POD_IP
: The IP address assigned to the pod.POD_NAME
: The name of the pod.POD_NAMESPACE
: The namespace of the pod.POD_SERVICE_ACCOUNT
: The name of the pod's service account.POD_TERMINATION_GRACE_PERIOD_SECONDS
: The pod'sterminationGracePeriodSeconds
field.CONTAINER_IMAGE
: The container's image field.CONTAINER_IMAGE_TAG
: The container's image's tag.CONTAINER_IMAGE_REPO
: The container's image's repository within the registry.CONTAINER_IMAGE_REGISTRY
: The container's image's registry.CONTAINER_CPU_REQUEST
: The number of CPU cores requested by the container.CONTAINER_MEMORY_REQUEST
: The number of bytes of RAM requested by the container.CONTAINER_MEMORY_LIMIT
: The number of bytes of RAM allowed to be used by the container before an OOM error.CONTAINER_EPHEMERAL_STORAGE_REQUEST
: The number of bytes of ephemeral storage from the node requested by the container.CONTAINER_EPHEMERAL_STORAGE_LIMIT
: The number of bytes of ephemeral storage from the node that the container is allowed to use before the pod is evicted.NODE_IP
: The IP address assigned to the node the pod is scheduled on.NODE_NAME
: The name of the node the pod is scheduled on.
Mounted Files
To mount files inside your containers, each of our controller submodules providers two inputs:
config_map_mounts
: A mapping of ConfigMap names to their mount configuration inside each container.secret_mounts
: A mapping of Secret names to their mount configuration inside each container.
Mounted Secrets and ConfigMaps need to inside the pod's namespace. 12 The file contents inside each container will be automatically updated if their values change in the source resource.
All containers created by Panfactum submodules will additionally have files mounted under /etc/podinfo
which will contain additional
metadata about the pod:
labels
: The pod's labelsannotations
: The pod's annotations
These files will be filled with newline-delimited entries with the format <key>="<value>"
.
Kubernetes API
In case you need to access the full pod manifest from inside one of the pod's
containers, every container created by a Panfactum submodule is authorized to
read its own pod manifest via Kubernetes API (e.g., kubectl --namespace $POD_NAMESPACE get pod $POD_NAME --output yaml
).
Temporary Directories
As containers are created in read-only mode, you will need to create temporary directories to be able to write to the local file system. 13
Our controller submodules provide an input to take care of this provisioning: tmp_directories
.
All created directories are size-limited by the size_mb
field on each directory configuration
object.
Directories may either be node-local or EBS-backed (the default). This is controlled
via the node_local
boolean field. If you only need a very small amount of space, you
can set this to true
to improve your pod's startup time by a few seconds. Otherwise,
we recommend you use the default EBS-backed directories to avoid exhausting a node's
limited disk space.
Labels and Annotations
Pod created by Panfactum submodules will automatically be labeled with our standard resource tags.
You can add additional labels and annotations via the extra_pod_labels
and extra_pod_annotations
inputs, respectively.
Priority Class
Kubernetes priority classes define precedence for pod preemption. Our kube_priority_classes module installs a few additional priority classes in addition to the Kubernetes defaults.
The following priority classes are included in the Panfactum stack (ordered by precedence):
system-node-critical
(2000001000
): Pod is required for the node it is scheduled on to function.system-cluster-critical
(2000000000
): Pod is required for the cluster itself to function.cluster-important
(100000000
): Losing this pod would leave the cluster in a degraded state; some functionality would be lost.workload-important
(10000000
): Disrupting this pod might leave certain workloads in a degraded state.default
(0
): The default priority assigned to pods.
You can create additional priority classes via kube_priority_classes
as your needs require.
Pods created by our controller submodules can have their priority classes set via the priority_class_name
input.
Permissions
Permissions in Kubernetes are assigned via a pod's Service Account this includes not just permissions to access the Kubernetes API but also permissions to access external systems such as AWS and Vault.
All our controller submodules will automatically create a dedicated Service Account for the pods managed
by the controller and provide its name via the service_account_name
output.
You can use the output to assign additional permissions to the pods by following our guides here.
Custom Workload Deployments
All the above information applies specifically to the Panfactum submodules for deploying workloads (e.g., kube_deployment). However, nothing prevents you from deploying workloads to Kubernetes directly without using our modules. Using Panfactum-flavored Kubernetes does not prevent you from doing anything that you could not do using stock Kubernetes.
A common use case is to use controllers that we do not provide by default such as Argo Rollouts.
Footnotes
-
The exception would be services running on edge compute. Coming soon. ↩
-
Alternatively, you can use a prebuilt image from a third-party. ↩
-
In fact, our submodules utilize a shared interface for defining pods: kube_pod. ↩
-
Not to be confused with Kubernetes namespaces. ↩
-
tail -f /dev/null
is used because this is a noop that prevents the command from terminating. If the container terminated, some controllers like a Deployment would continually try to restart it. ↩ -
This is only true if using a Panfactum-provided submodule and does not apply to Argo Workflows. ↩
-
Once a numbered StatefulSet pod is scheduled in a zone, it will always be scheduled in that zone.
ScheduleAnyway
allows for temporary imbalances to occur, but for a StatefulSet the imbalance will become permanent. ↩ -
Spot tolerations are added to all pods by default, even those not created by Panfactum submodules. You can add the label
"panfactum.com/spot-enabled" = "false"
to the pod to prevent spot tolerations from being injected. ↩ -
ARM64 tolerations are added to all pods by default, even those not created by Panfactum submodules. You can add the label
"panfactum.com/arm-enabled" = "false"
to the pod to prevent spot tolerations from being injected. ↩ -
Unlike in base Kubernetes, workloads will be automatically restarted if the referenced ConfigMap or Secret value changes. ↩ ↩2
-
The environment variables are added to all containers by default, even those not created by Panfactum submodules. You can add the label
"panfactum.com/inject-env-enabled" = "false"
to the pod to prevent these environment variables from being injected into its containers. ↩ -
If you need to copy the same Secret or ConfigMap across multiple namespaces, see the Kubernetes Reflector documentation (already installed in cluster as a part of the bootstrapping guide). ↩
-
Unless you are using kube_stateful_set which adds support for persistent volumes which will retain their data across pod restarts. ↩