Autoscaling
Objective
Deploy the necessary metrics and autoscaling components to automatically rightsize the cluster.
Background
In Kubernetes, there are three distinct flavors of autoscaling:
-
Horizontal Pod Autoscaling (HPA): A built-in controller that adjusts the number of pods for a Deployment or StatefulSet. This can be extended to incorporate event-driven autoscaling with tools like KEDA.
-
Vertical Pod Autoscaling (VPA): A specific controller that can be installed to automatically adjust the resource requests and limits of pods based on historical usage.
-
Cluster Autoscaling: A category of controllers that will adjust the number of nodes running in the cluster. The two most popular projects are the Cluster Autoscaler and Karpenter.
The Panfactum stack makes use of all three types of autoscaling.
Deploy Metrics Server
For autoscaling to work, the Kubernetes API server must provide realtime metrics about individual container CPU and memory usage. Interestingly, this is not built-in to Kubernetes by default but rather powered by an API extension provided by the metrics-server project.
We provide a module to deploy the server: kube_metrics_server.
Let's deploy it now:
-
Create a new directory adjacent to your
kube_linkerd
module calledkube_metrics_server
. -
Add a
terragrunt.hcl
to that directory that looks like this. -
Set
vpa_enabled
tofalse
. We will enable it when we deploy the VPA. -
Add a
module.yaml
that enables theaws
,kubernetes
, andhelm
providers. -
Run
terragrunt apply
.
Let's test to ensure it is working as intended:
-
Open k9s (or restart if it is already open).
-
Notice that k9s is now reporting your total cluster resource utilization:
-
Navigate to the pods view. Notice that all pods are now reporting CPU and memory metrics:
The
N/A
fields for the/R
and/L
columns indicate that many pods have not had their resource requests (R) or limits (L) set. When we install the VPA, these will automatically be set. -
To diagnose utilization issues in the cluster, we bundle a CLI utility called
kube-capacity
for consolidating granular metrics across the entire cluster. Runkube-capacity -uc
now:NODE NAMESPACE POD CONTAINER CPU REQUESTS CPU LIMITS CPU UTIL MEMORY REQUESTS MEMORY LIMITS MEMORY UTIL * * * * 1240m (21%) 3700m (63%) 514m (8%) 1740Mi (9%) 5334Mi (29%) 4596Mi (25%) .... ip-10-0-213-182.us-east-2.compute.internal * * * 360m (18%) 1200m (62%) 161m (8%) 530Mi (8%) 1238Mi (20%) 1420Mi (23%) ip-10-0-213-182.us-east-2.compute.internal cilium cilium-4d4gf * 100m (5%) 0m (0%) 25m (1%) 10Mi (0%) 0Mi (0%) 126Mi (2%) ip-10-0-213-182.us-east-2.compute.internal cilium cilium-4d4gf cilium-agent 0m (0%) 0m (0%) 25m (1%) 0Mi (0%) 0Mi (0%) 126Mi (2%) ip-10-0-213-182.us-east-2.compute.internal aws-ebs-csi-driver ebs-csi-node-ph26m * 30m (1%) 100m (5%) 2m (0%) 120Mi (1%) 768Mi (12%) 26Mi (0%) ip-10-0-213-182.us-east-2.compute.internal aws-ebs-csi-driver ebs-csi-node-ph26m ebs-plugin 10m (0%) 0m (0%) 1m (0%) 40Mi (0%) 256Mi (4%) 11Mi (0%) ip-10-0-213-182.us-east-2.compute.internal aws-ebs-csi-driver ebs-csi-node-ph26m linkerd-proxy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal aws-ebs-csi-driver ebs-csi-node-ph26m liveness-probe 10m (0%) 0m (0%) 1m (0%) 40Mi (0%) 256Mi (4%) 8Mi (0%) ip-10-0-213-182.us-east-2.compute.internal aws-ebs-csi-driver ebs-csi-node-ph26m node-driver-registrar 10m (0%) 0m (0%) 1m (0%) 40Mi (0%) 256Mi (4%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal cert-manager jetstack-cert-manager-7b467c7747-xhbrx * 10m (0%) 100m (5%) 2m (0%) 10Mi (0%) 10Mi (0%) 16Mi (0%) ip-10-0-213-182.us-east-2.compute.internal cert-manager jetstack-cert-manager-7b467c7747-xhbrx cert-manager-controller 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 13Mi (0%) ip-10-0-213-182.us-east-2.compute.internal cert-manager jetstack-cert-manager-7b467c7747-xhbrx linkerd-proxy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-destination-96c96755b-8thzg * 10m (0%) 100m (5%) 3m (0%) 10Mi (0%) 10Mi (0%) 45Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-destination-96c96755b-8thzg destination 0m (0%) 0m (0%) 2m (0%) 0Mi (0%) 0Mi (0%) 22Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-destination-96c96755b-8thzg linkerd-proxy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 8Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-destination-96c96755b-8thzg policy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 6Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-destination-96c96755b-8thzg sp-validator 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 10Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-identity-cc6dffdf-tt9jm * 10m (0%) 100m (5%) 1m (0%) 10Mi (0%) 10Mi (0%) 15Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-identity-cc6dffdf-tt9jm identity 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 11Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-identity-cc6dffdf-tt9jm linkerd-proxy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-proxy-injector-8497c6bd8-dwv6j * 10m (0%) 100m (5%) 1m (0%) 10Mi (0%) 10Mi (0%) 19Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-proxy-injector-8497c6bd8-dwv6j linkerd-proxy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal linkerd linkerd-proxy-injector-8497c6bd8-dwv6j proxy-injector 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 16Mi (0%) ip-10-0-213-182.us-east-2.compute.internal metrics-server metrics-server-6df8ffd998-65mkb * 100m (5%) 100m (5%) 4m (0%) 200Mi (3%) 10Mi (0%) 24Mi (0%) ip-10-0-213-182.us-east-2.compute.internal metrics-server metrics-server-6df8ffd998-65mkb linkerd-proxy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal metrics-server metrics-server-6df8ffd998-65mkb metrics-server 100m (5%) 0m (0%) 3m (0%) 200Mi (3%) 0Mi (0%) 20Mi (0%) ip-10-0-213-182.us-east-2.compute.internal secrets-csi secrets-csi-secrets-store-csi-driver-zch5w * 70m (3%) 400m (20%) 1m (0%) 140Mi (2%) 400Mi (6%) 26Mi (0%) ip-10-0-213-182.us-east-2.compute.internal secrets-csi secrets-csi-secrets-store-csi-driver-zch5w linkerd-proxy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal secrets-csi secrets-csi-secrets-store-csi-driver-zch5w liveness-probe 10m (0%) 100m (5%) 1m (0%) 20Mi (0%) 100Mi (1%) 8Mi (0%) ip-10-0-213-182.us-east-2.compute.internal secrets-csi secrets-csi-secrets-store-csi-driver-zch5w node-driver-registrar 10m (0%) 100m (5%) 1m (0%) 20Mi (0%) 100Mi (1%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal secrets-csi secrets-csi-secrets-store-csi-driver-zch5w secrets-store 50m (2%) 200m (10%) 1m (0%) 100Mi (1%) 200Mi (3%) 12Mi (0%) ip-10-0-213-182.us-east-2.compute.internal vault vault-1 * 10m (0%) 100m (5%) 41m (2%) 10Mi (0%) 10Mi (0%) 52Mi (0%) ip-10-0-213-182.us-east-2.compute.internal vault vault-1 linkerd-proxy 0m (0%) 0m (0%) 4m (0%) 0Mi (0%) 0Mi (0%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal vault vault-1 vault 0m (0%) 0m (0%) 38m (1%) 0Mi (0%) 0Mi (0%) 49Mi (0%) ip-10-0-213-182.us-east-2.compute.internal vault vault-csi-provider-4cndp * 10m (0%) 100m (5%) 3m (0%) 10Mi (0%) 10Mi (0%) 34Mi (0%) ip-10-0-213-182.us-east-2.compute.internal vault vault-csi-provider-4cndp linkerd-proxy 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 4Mi (0%) ip-10-0-213-182.us-east-2.compute.internal vault vault-csi-provider-4cndp vault-agent 0m (0%) 0m (0%) 2m (0%) 0Mi (0%) 0Mi (0%) 25Mi (0%) ip-10-0-213-182.us-east-2.compute.internal vault vault-csi-provider-4cndp vault-csi-provider 0m (0%) 0m (0%) 1m (0%) 0Mi (0%) 0Mi (0%) 6Mi (0%)
Deploy the Vertical Pod Autoscaler
Now that we are capturing resource utilization data, we can use the vertical pod autoscaler to automatically rightsize your pod's CPU and memory resource requests and limits based on historical usage.
We provide a module to deploy the VPA: kube_vpa.
Let's deploy it now:
-
Create a new directory adjacent to your
kube_metrics_server
module calledkube_vpa
. -
Add a
terragrunt.hcl
to that directory that looks like this. -
Set
vpa_enabled
tofalse
. We will enable it in a moment. -
Add a
module.yaml
that enables theaws
,kubernetes
, andhelm
providers. -
Run
terragrunt apply
. -
Once the module has successfully deployed, return to all of the previously deployed kubernetes modules and set
vpa_enabled
totrue
:kube_aws_ebs_csi
kube_cert_manager
kube_cilium
kube_linkerd
kube_metrics_server
kube_secrets_csi
kube_trust_manager
kube_vault
kube_vpa
-
Navigate up to the directory containing all of these modules (this should be the parent of
kube_vpa
). Runterragrunt run-all apply
to apply all of the modules at once. Note that this might cause your Vault proxy to disconnect as pods are restarted and thus result in an incomplete apply. Simply reconnect the proxy and try the command again. -
After all these modules are updated, you should now see many VPA resources in k9s (
:vpa
). After a minute or two, they should begin to provide resource estimates: -
Return to the pod view. Notice that all request columns are populated as well as the memory limit column:
-
We set requests for both CPU and memory so that the Kubernetes scheduler can accurately decide which enough resources available for pods during the pod-to-node assignment phase.
-
We set memory limits so that memory leaks in any application will not consume unbounded memory on the node and thus cause all other pods on the node to crash with OOM errors.
-
We do not set cpu limits because if a node's CPU is constrained, CPU will automatically be shared across all pods proportional to each pod's CPU request. 1
-
Deploy Karpenter
Historically, cluster autoscaling has almost always been done by the cluster-autoscaler addon. However, this controller had several limitations and required lots of configuration to provide the optimal setup for your organization.
Fortunately, a new provide has emerged, Karpenter, which provides both more flexibility, better performance, and better cost optimization. Specifically:
-
You do not need to choose instance types in advance as it will query your cloud providers API for the entire list of available instance types and prices and then spin up the exact right instance types that are the best fit for your workloads at the lowest cost
-
If you allow it, it will prioritize utilizing spot instances for up to a 90% discount over list price. Moreover, it will periodically check if existing workloads can be migrated to cheaper instances as prices change over time.
-
It incorporates the functionality provided by the aws-node-termination-handler removing the need to run it altogether.
-
It scales extremely well with the ability to spin up and down 100s of nodes at a time.
We provide an infrastructure module to deploy it: kube_karpenter.
Let's deploy it now:
-
Create a new directory adjacent to your
kube_vpa
module calledkube_karpenter
. -
Add a
terragrunt.hcl
to that directory that looks like this. -
For the first time, you can leave
vpa_enabled
set totrue
. -
For
node_subnets
, we strongly recommend using the same subnets you used in theaws_eks
module for thecontroller_node_subnets
unless you have a specific reason not to. -
Add a
module.yaml
that enables theaws
,kubernetes
, andhelm
providers. -
Run
terragrunt apply
.
Deploy NodePools
Karpenter requires some instructions about how to perform autoscaling such as what AMI to use for the underlying nodes. It looks for those instructions in NodePool custom resources.
The Panfactum stack comes with two default NodePools:
-
linux spot nodes
-
labeled with
panfactum.com/class
set tospot
-
highest scheduling precedence
-
assigned a taint to prevent workloads that cannot tolerate arbitrary disruption from accidentally being scheduled on one:
{ key = "spot", value = "true", effect = "NoSchedule" }
-
-
linux on-demand nodes
- labeled with
panfactum.com/class
set toworker
2
- labeled with
Those are defined in the kube_karpenter_node_pools module.
Let's deploy it now:
-
Create a new directory adjacent to your
kube_karpenter
module calledkube_karpenter_node_pools
. -
Add a
terragrunt.hcl
to that directory that looks like this. -
Add a
module.yaml
that enables thekubernetes
provider. -
Run
terragrunt apply
.
Test Cluster Autoscaling
Let's verify that autoscaling now works as expected.
All deployments in the Panfactum stack are configured to disallow multiple replicas of the same pod to run on the same node. 3 As a result, we can increase the number of replicas in one of our pods and ensure that Karpenter recognizes that additional nodes need to be schedule.
-
In K9s, navigate to
:deployments
. -
Highlight the
jetstack-cert-manager
deployment and presss
to trigger the scale dialogue. -
Increase this to a substantial number such as
25
. -
Immediately, Karpenter should recognize that new nodes are required to run these pods and begin to provision them. After 30-60 seconds, the nodes will be registered with Kubernetes and you should see them under
:nodes
. -
Describe some of the new nodes by pressing
d
when highlighting one. Notice that many of these new nodes are spot instances. -
Scale
jetstack-cert-manager
back down to2
. Karpenter will automatically take care of the node cleanup and de-provisioning. This will happen slower than the scale-up as scaling down happens in batches to avoid cluster thrash or service disruption. However, this should complete within 15 minutes.
Next Steps
Now that autoscaling is active, we can proceed to set up inbound networking for our cluster.
Footnotes
-
In general, it is best to avoid setting CPU limits in order to allow CPU to occasionally spike as needed. If the CPU utilization remains high, cluster autoscaling will kick in to provision net new nodes. ↩
-
The EKS-managed nodes created by
aws_eks
have a class ofcontroller
. ↩ -
To prevent service disruption if a node goes offline. ↩