Autoscaling
Background
In the Panfactum stack, there are three distinct flavors of autoscaling:
-
Horizontal Pod Autoscaling (HPA): A built-in controller that adjusts the number of pods in a replicated service (e.g., Deployment, StatefulSet, etc.). This can be extended with tools like KEDA.
-
Vertical Pod Autoscaling (VPA): A specific controller that can be installed to automatically adjust the resource requests and limits of pods based on historical usage.
-
Cluster Autoscaling: A category of controllers that will adjust the number of nodes running in the cluster. Karpenter runs the cluster autoscaling in the Panfactum stack.
Horizontal and vertical autoscaling are opt-in and required creating specific autoscaling manifests. Cluster autoscaling occurs automatically.
Resource Utilization
A key benefit of autoscaling is maximizing resource utilization. Importantly, there are two core components of overall utilization:
We call the first term consumption efficiency, the second term provisioning efficiency, and the result net efficiency / overall utilization.
Why do we have the concept of requested resources in addition to consumed resources? In short, multi-node scheduling orchestrators like Kubernetes must know in advance how many resources the workload intends to use in order to ensure the workload is placed on a node with sufficient free resources available. 1
Let's run through a simple example:
- A workload requests 780 MB of memory.
- However, throughout its lifetime, the workload actually only used an average of 400 MB of memory, resulting in a 51% consumption efficiency.
- Since EC2 instances only come in discrete sizes, a 1 GB EC2 node was provisioned, resulting in 78% provisioning efficiency.
- Altogether, this produced a 40% net efficiency (51% x 78%).
We look at these numbers independently as they require different approaches to optimize.
Ultimately, best-in-class net efficiency tends be around 66%. Any higher would likely jeopardize system stability by not allowing sufficient headroom to handle load spikes.
Consumption Efficiency
Optimizing consumption efficiency requires that resource requests closely match the average resource usage of the workload.
To achieve in this, we utilize the VPA to automatically calculate and adjust requests based on historical usage. However, the VPA does not simply request the average resource utilization. Instead, it calculates requests as the 90th percentile (P90) usage x 115%. 2 3
Why? This ensures the workload is requesting enough resources to operate even under the heaviest load spikes, preventing issues such as application crashes or throttling.
Ultimately, this means that in order to optimize consumption efficiency, you must attempt to minimize the gap between the average resource usage and P90 usage.
Let's visualize this:
As you can see, the tighter the spread of usage values, the better the consumption efficiency.
A good consumption efficiency is typically between 65-85%. This means the majority of the requested resources are being utilized, but you still have enough spare capacity to handle an occasional surge.
There are three key practices that help hit that target range:
-
Ensure that a single workload performs only one task.
For example, consider an application server with a database backend. The container that serves API requests should not also be the container that runs database migrations. Use a separate init container instead.
-
Ensure that every workload input consumes similar amounts of resources.
For example, consider an application server to a few dozen API endpoints. In this example, every incoming request is an input.
The majority of requests are served relatively quickly. However, one endpoint processes image uploads and performs a significant amount of CPU and memory when processing the request. Requests to this endpoint will increase the maximum resource usage without having a significant impact on the average. As a result, it should be moved to a separate workload or an async background job.
Note that this does not mean you need a microservice architecture. You can simply deploy multiple copies of the same codebase with different flags for enabling different endpoints.
-
Do not let workloads idle unless necessary.
For some systems like public API servers, it is necessary for them to be online even when not processing requests.
However, systems that perform asynchronous tasks like processing items in a queue should not remain running when the queue is empty.
Additionally, avoid doing resource-heavy work in always online systems but have them trigger jobs in async background jobs that can be scaled to zero when not in use.
Provisioning Efficiency
Optimizing provisioning efficiency requires striking a balance between scheduling constraints and your desired efficiency levels.
Our provisioner, Karpenter, will always attempt to select the least costly nodes that satisfy the resource requests and will even attempt to consolidate nodes when resource requests shrink. However, Karpenter will also attempt to honor the scheduling constraints used by the Kubernetes scheduler for assigning pods to nodes. Scheduling constraints include:
Among other things, these constraints allow us to define a system that is highly available by
- Spreading workload pods evenly across availability zones
- Preventing pods in the same workload unit from running on the same underlying EC2 node
- Preventing pods from running on spot or burstable instances
However, each of these constraints prevents the Kubernetes scheduler from bin-packing pods as tightly as possible. For example, consider the following scenario where topology spread constraints are applied to ensure the pods are spread evenly across AZs:
Even when accounting for constraints, you might find that your provisioning efficiency is lower than expected. There are a few common misconceptions that can decrease efficiency:
-
Karpenter does not run a global bin-packing optimization algorithm. When attempting to optimize provisioned resources, it only looks to see if (a) one node can be completely removed, (b) two nodes can be consolidated, or (c) one node can be replaced with one of a smaller size. While global bin-packing would result in higher efficiency, it is too computationally expensive to run continuously.
-
Using "preferred" scheduling constraints such as
ScheduleAnyway
(topology spread) orpreferredDuringSchedulingIgnoredDuringExecution
(pod affinities) are treated as requirements by Karpenter when it is performing its scale-up and scale-down calculations. 4This can prevent Karpenter from triggering node removal even if pods could be scheduled on other nodes. Karpenter makes the assumption that your preferences should be respected even if that comes at increased cost.
As a result, do not use preferences unless you intended for the system to actively attempt to achieve your preferences even at increased cost.
-
Karpenter will not replace spot nodes with cheaper spot nodes of similar size unless there are at least 15 cheaper instance types. Spot node pricing fluctuates regularly and this would result in a continuous replacement loop. Instead, Karpenter waits until it is confident that spot pricing is consistently lower with other instance types before it triggers replacement.
This can run into issues when using smaller instance types where 15 cheaper instances will not ever exist. As a result, we limit node lifetimes to 24 hours to force replacement of nodes and allow a chance at cheaper instances at least once per day.
-
The "allocatable" resources on each node will be smaller than the AWS list spec. For example, a
t4g.medium
instance comes with 4 GiB of memory but only about 3.25 GiB will be available in Kubernetes. This is because about 500-750 MiB of resources are reserved by the host operating system components such as the kubelet and containerd..This is further exacerbated by DaemonSets: pods that run on all Kubernetes nodes such as the Cilium CNI.
This overhead is relatively fixed. As a result, fewer, larger nodes will have higher provisioning efficiency than many, small nodes. In the Panfactum stack, the smallest node that can be provisioned is 2 vCPU / 4 GiB for this reason.
Storage Autoscaling
There is one additional autoscaler that deserves mention: the PVC autoresizer.
PVCs are the storage volumes attached to pods. Typically, these are used to provide persistent disks for database workloads. In the Panfactum stack, we use EBS volumes for the disks by default.
The PVC autoresizer is a type of vertical autoscaler that will monitor these disks and expand them as necessary allowing your storage capacity to grow automatically without manual intervention.
While this is convenient, there are a few important considerations to remain mindful of:
- This auto-scaling is opt-in and requires adding explicit annotations to your PVCs.
- This only allows for volume expansion, not reduction.
- Resizing can only happen once every six hours (AWS limitation); as a result, your maximum hourly data increase rate must never
exceed the value for
resize.topolvm.io/increase
/ 6.
Ultimately, this means you will always still be responsible for choosing an initial disk size and growth rate.
Common Issues
-
All autoscalers respect PodDisruptionBudgets (PDBs) and will not resize / evict pods if this would violate a PDB. As a result, ensure you design your workloads so that pods can be disrupted during regular operations without causing system downtime. Otherwise, autoscaling will not work properly.
-
The VPA will not automatically scale init containers including native sidecars (open issue).
This is particularly impactful to the Linkerd sidecars used in the stack. While we have provided default resource parameters that should work for most workloads, you may need to manually configure the resources for the proxy for workloads that exceed the default limits. Alternatively, you can set
config.alpha.linkerd.io/proxy-enable-native-sidecar
tofalse
in your pod annotations to run the proxy as a regular, non-init container (though be mindful of the tradeoffs). -
EKS node groups are not automatically scaled. We provide static sizing for these groups in the
aws_eks
module. This is required because Karpenter cannot run on nodes that it itself provisioned.
Footnotes
-
Requested resources are the resources the workload requested be made available before it started running while consumed resources are the resources actually used. ↩
-
How far back the VPA looks for percentile calculations is configurable, and in the Panfactum stack we have it prioritize data from the trailing four hours. We have found this allows for good balance between responsive scaling and system stability. ↩
-
For memory, the VPA also tracks OOM events and ensures that enough memory is provisioned to avoid future OOMs. This typically results in a higher request than the 90th percentile x 115%. ↩
-
If a preference cannot be met during scale-up, Karpenter will try again without it. ↩