Cluster Networking

Background Information

Networking is an extremely deep and broad topic. We have collated a comprehensive set of resources that we strongly recommend to every engineer looking to understanding how data moves between subsystems in their cluster:

Basic Networking Concepts
- Networking Fundamentals (Video Playlist): Core networking concepts at each layer of the OSI model
- TCP/IP Illustrated (Textbook): The reference book for any core networking protocol
- DNS Explained (Article): A series of short articles provided by Cloudflare explaining the major moving parts of DNS infrastructure.
Applied Networking on Linux
- Iptables Tutorial (Online Textbook): An overview of how networking was originally designed to work on linux systems (while this technology is not usually used anymore, the concepts described here remain pervasive when discussing linux networking)
- Nftables Tutorial (Article): A brief overview of how nftables replaces iptables on modern linux distributions
- Netfilter docs (Website): A documentation hub that provides references to just about everything you might want to know about iptables / nftables
- eBPF Intro (Article): An overview of the new linux technology that most low-level tooling is adopting as a replacement to static configuration tools like iptables / nftables
Networking on Kubernetes
- Life of a Packet (Video): Great high-level overview of how data moves through the discrete components of the Kubernetes stack
- A visual guide to Kubernetes networking fundamentals (Article): A condensed and visual reference guide covering the major concepts in the prior video
- CNI Explained in 7 Minutes (Video): A brief overview of the motivations behind the container network interface (CNI) specification
- Why replace iptables with eBPF (Article): Blog post providing an overview for the motivations for using eBPF in current generation networking implementations

Layer 3 / 4

Layer 3 and 4 (L3/4) of the OSI model is where most software-defined networking (SDN) begins. In Kubernetes, this refers specifically to the TCP and IP protocols.

Kubernetes provides some out-of-the-box functionality for L3/4 networking. This includes two core capabilities:

Assigning IP addresses to pods running in the cluster
Assigning virtual IP addresses to Services that allows traffic to be distributed across sets of pods

However, the built-in capabilities of Kubernetes lack both power and flexibility and don’t even implement core Kubernetes APIs such as Network Policies (firewall rules for Kubernetes). This is by design. When using Kubernetes, it is expected that you will leverage Container Network Interface (CNI) plugins to enhance the core L3/4 networking stack.

While there are many options in this space, the two most popular are:

Cilium by Isovalent
Calico by Tigera

They both offer very similar sets of capabilities, but we chose Cilium for the Panfactum Stack for two key reasons:

For a foundational part of the stack, we strongly prioritize project longevity and license security as switching a CNI plugin is an extremely disruptive operation. Unlike Calico, Cilium is CNCF-graduated project meaning that it will likely continue being community-maintained OSS software for the foreseeable future.
Calico focuses on many infrastructure platforms, not just Kubernetes. This can be helpful if you have a diverse deployment environment that mixes many technologies; however, it also means their support surface area is larger, and they will move slower as the landscape evolves. On the other hand, Cilium is purely Kubernetes focused, and it often brings new capabilities to market significantly faster (e.g., eBPF-based networking).

Cilium

Cilium takes advantage of eBPF-based networking rather than the Kubernetes default, iptables, in order to add many critical capabilities to the Kubernetes networking stack:

Allows for implementation of Network Policies
Reduces operational complexity by consolidating several runtime components that would otherwise need to be individually managed and integrated
Significantly improves networking performance per dollar
Allows for more intelligent load balancing and system resiliency
Adds TCP/IP observability that would otherwise be impossible
Adds support for cluster meshes

The Panfactum Stack’s Cilium deployment (kube_cilium) has a few enhancements over the default Cilium install:

IP Address Management (IPAM)

By default, Kubernetes clusters run their own network (discrete CIDR block) on top of the network infrastructure provided by the infrastructure host. This is to provide the necessary flexibility to schedule many workloads (pods) on a single node.

However, EKS clusters run the AWS VPC CNI which assigns pods IP address from the containing VPC. That CNI has several downsides, and Cilium provides an alternative implementation with the same functionality. As a result, we use that instead.

Replacing kube-proxy

Kube-proxy is not a proxy but rather a daemon running on each node that allows for IP address assignment to Kubernetes Services and Endpoints. Unfortunately, it is iptables-based and not natively integrated with Cilium. While Cilium is typically thought of as a CNI (focused on pod / container networking), it also offers a replacement to this part of the default Kubernetes networking setup. We use this in the Panfactum stack to bring the benefits of eBPF to all parts of the networking infrastructure.

Service Mesh

At higher layers of the OSI model there exists an additional category of technologies that sit atop the L3/L4 stack: service meshes.

Put abstractly, a service mesh is a system component that mediates network communication between workloads in your cluster. Unfortunately, the Kubernetes ecosystem lacks concrete standards for what a service mesh specifically ought to do, but some progress is being made with the Gateway API. ¹

That said, the core motivation for a service mesh is to centrally implement best practices for layer 7 communications (e.g., http, grpc, etc.) in your cluster.

This not only ensures consistency but also removes the burden from each application developer to implement best-practices like mTLS, metrics reporting, etc.

How a Service Mesh works

In Kubernetes, a service mesh is typically implemented as a central controller that injects sidecar containers into each pod. Traffic from each pod is routed through the sidecar containers before leaving and entering the pod. The sidecar containers then become the only containers that communicate directly with one another and thus create a mesh that captures and orchestrates all network traffic.

There are many implementations of this paradigm. The most popular ones are:

Linkerd: Simple and powerful, but not as many features
Istio: Extremely complicated, but allows deep customization
Consul: Originally developed by Hashicorp for their Kubernetes alternative, Nomad

Do You Need A Service Mesh?

This can be a divisive topic as many Kubernetes veterans have suffered from poorly implemented service mesh deployments. Moreover, a service mesh is not absolutely essential to the functioning of most clusters.

However, we believe the complexities are overstated, and the benefits are numerous. For an additional 15 minutes of setup time (if using the Panfactum stack), you automatically get the following without any additional work:

Automatic mTLS: Never worry about manually implementing https or managing certificates while also ensuring the highest level of security for all network communications.
Locality-aware traffic routing: Improve performance by ensuring traffic is always routed to the fastest-responding server. Significantly reduce cross-AZ network charges.
Circuit breaking: Reroute traffic away from unhealthy or resource-constrained pods to improve overall cluster resiliency.
Observability: Collect granular data about network throughput and latency that can be used to power observability tooling.
Tapping: Gain the ability to introspect network traffic in realtime to improve overall debugging ability.

Additionally, most service meshes come with many more advanced capabilities that are entirely optional but nice to have if you ever need to use them: per-route traffic splitting, multi-cluster networking, etc.

Ultimately, you do not need a service mesh, but you should choose to use one if you are working with Kubernetes in a professional setting. As a result, we include one in the Panfactum stack.

Linkerd2

In the Panfactum stack, we use Linkerd as our service mesh technology. Linkerd has several benefits over the alternatives:

Graduated CNCF project
Small resource footprint and highly performant
Extremely simple to setup and maintain relative to other options (looking at you, Istio)
Most of its functionality can be provided without needing to learn or implement new Kubernetes resource types

You can see our implementation in the kube_linkerd terraform module.

Native Sidecars

Starting in Kubernetes 1.28, Kubernetes provides a new mechanism for running service mesh containers: native sidecars. While this is an experimental feature in Linkerd, we have it enabled by default as it solves many operational challenges when working with service mesh sidecars. For more information, see Linkerd’s blog post on the topic.

New Release Model

In February 2024, the maintainers of Linkerd announced a change (Buoyant) to how they would be supporting Linkerd that caused some consternation as it removed the availability of stable releases from non-paying customers.

While we are monitoring the situation, there are a few key takeaways:

The Linkerd project’s license has not changed and is still OSS.
The “edge” releases continue to be shipped and will continue to contain the all of the code in the stable releases. Edge releases simply do not provide any semantic versioning, come much more frequently (weekly), and do not receive back-ported bugfixes.
We have updated the Panfactum stack to use the edge release channels and perform our own stability testing. With the exception of security issues, we will only update the release versions when we release new major versions of the Panfactum stack.
While mildly inconvenient, we hope that this provides the Linkerd maintainers a pathway to fund continued development and maintenance of an excellent project.

Buoyant has posted a more detailed update here.

Inbound Networking

Both Cilium and Linkerd primarily deal with intra- and inter-cluster networking. That said, a key additional concern for Kubernetes administration is how to receive and route traffic from the public internet to workloads running in the cluster.

In most Kubernetes deployments, this involves a two-layer architecture:

A LoadBalancer Service that provisions hardware-based load balancing (typically through a cloud provider). Public IP addresses are assigned to this load balancer which in turn routes traffic to the private IPs of selected pods running in your cluster. These operate at L3/4 of the networking stack. ²

An Ingress controller resource which receives traffic and applies security protocols (e.g., TLS/HTTPS) and L7 routing rules. Typically, you have a single ingress controller for your entire cluster which receives all inbound traffic and then routes traffic to other pods based on your custom routing rules (e.g., HTTP /api/v1 goes to foo-service and /api/v2 goes to bar-service). ³ ⁴

While Kubernetes defines these APIs, it does not provide the underlying controllers that can provision and manage the real infrastructure that back these resources. It is expected that a cluster operator will install these when setting up their cluster.

In the Panfactum Stack, we use the AWS Load Balancer Controller to provision AWS Network Load Balancers for our LoadBalancer Services. This is the recommended choice for all clusters running on AWS as the AWS-provided load balancers are deeply integrated with systems like EC2 and EKS and are extremely capable.

For your Ingress controller, there are many more choices available. ⁵ Some service mesh technologies like Istio have Ingress support built-in. Additionally, there are dozens of additional ingress controllers that all add unique capabilities. The most popular are:

Ingress-NGINX Controller: Configures NGINX instances to back your Ingress resources. The de-facto standard choice.
Traefik Controller: Configures the Traefik Proxy to back your Ingress resources. A newer technology that touts its easier out-of-the-box configuration.
HAProxy Controller: Configures HAProxy to back your Ingress resources. Often used when support is needed for more niche networking protocols.

In the Panfactum Stack, we use the Ingress-NGINX Controller (kube_ingress_nginx) for the following reasons:

As another foundational part of the stack, we prioritize project longevity and licenses security. This is the only controller supported directly by the Kubernetes project.
This controller has the most amount of public documentation and community examples that cover almost every deployment scenario imaginable.
The controller is simply a light wrapper around NGINX, which has been battle-tested over 20 years across millions of deployments. NGINX itself is extremely performant, featureful, and flexible.

Ultimately, inbound traffic to clusters in the Panfactum Stack looks like the following: ⁶

Which traffic gets routed to which services is controlled via Ingress manifests. We provide a submodule for defining Ingress manifests for our Ingress-NGINX controller: kube_ingress.

DNS

An important element of Kubernetes networking that we have not yet covered is DNS. Surprisingly DNS is not strictly required for Kubernetes to function properly, but it is essentially a de-facto requirement due to the ephemeral nature of IP address management in Kubernetes. IP addresses are constantly allocated, assigned, and discarded as workload topologies change, and an automated system is required to keep track of what IPs are assigned to which workloads: that system is DNS.

While Kubernetes provides its own reference implementation for an authoritative DNS resolver (kube-dns), CoreDNS will become the official DNS server for Kubernetes starting in v1.31. CoreDNS is extremely performant and provides one of the most stable implementations of the Kubernetes DNS-Based Service Discovery specification. This allows your workloads to connect to domains like my-service.my-namespace.svc.cluster.local.

We provide a module for deploying CoreDNS to Panfactum Stack clusters, kube_core_dns, which is set up in the bootstrapping guide. ⁷

It is important to note that this is a private DNS server. It is only accessible by workloads running inside your cluster and if you try to connect to addresses such as my-service.my-namespace.svc.cluster.local from outside a cluster, the DNS resolution will fail.

Public DNS

In addition to internal DNS resolution, you will generally want to have public DNS records point to the public IP addresses of your load balancers so your clients can access your workloads via human-readable names.

Kubernetes maintains the ExternalDNS controller for this use case. After deploying this controller to your cluster, when you create new load balancers or Ingress resources, the appropriate DNS records will automatically be created in your public DNS provider.

In the Panfactum Stack, we provide the kube_external_dns module which will provision DNS records in AWS Route 53. The specific instructions for configuring this are covered in the bootstrapping guide. ⁸

SSH Bastion

While Ingresses are the preferred solution for managing user traffic, occasionally cluster administrators or application developers will want to connect to arbitrary private network resources such as databases directly. Using Ingresses for this traffic is not preferred for a few reasons:

Ingresses offer limited flexibility for what upstream systems you can connect with. ⁹
Ingress connections are typically designed for L7 traffic (HTTP/gRPC), and you often need a lower level protocol (TCP) to establish connections with systems such a databases.
Ingresses typically rely on only the upstream system for authz/n which does not provide good defense-in-depth for sensitive systems like databases.

As a result, you will need an alternative means to establish ad-hoc direct network connections to private network resources. Kubernetes provides a built-in mechanism called port forwarding to accomplish this use the kubectl CLI.

While this is handy, it has limited flexibility and suffers from reliability issues for long-lived tunnels. In response to these limitations, we provide SSH bastions in the Panfactum Stack that can be leveraged for SSH tunneling: kube_bastion.

These SSH bastions are integrated with the Vault clusters running in each cluster to provide out-of-the-box SSO. The Panfactum devShell provides a convenience command, pf-tunnel, that will automatically create the necessary SSH certificates and build a tunnel that routes traffic through the bastion. ¹⁰

This command opens up tunnels that looks like this:

In the above example, a tunnel is opened on port 3000 locally and all traffic sent to this localhost:3000 is routed through the SSH bastion before being forwarded to remotehost:80 ¹¹.

For more information on SSH and why we chose this architecture over a VPN-based system, see these concept docs.

Network Security

When it comes to securing Kubernetes-based networks, there are two critical tools in an administrator’s toolbox:

Firewall rules: Controlling which workloads are allowed to connect with each other and external systems
TLS: Encrypting network connections via identity-based X.509 certificate exchange

Firewall Rules

A firewall is any component on the network path that can drop packets based on preconfigured rules. In most datacenters, firewall rules are implemented in a centralized system hardware such as routers and L3 switches. This is called a network firewall. In AWS, this is implemented through Security Groups and Network ACLs.

Kubernetes adds the concept of host firewalls where additional routing rules are distributed to every individual node through the CNI plugins you have installed. This allows for strong security to be maintained even in more diverse deployment scenarios.

Despite this architectural difference, deploying firewall rules in Kubernetes is a similar process to what you might find in a network firewall. However, there are a few key differences:

Firewall rules are implemented through Network Policies. Your CNI plugins will implement these rules in the route tables they build on every node (in the Panfactum Stack, this is handled by Cilium).
When using a network firewall, firewall rules can only be implemented for traffic that crosses network boundaries. ¹² In Kubernetes, there is no such restriction (in fact, there is no concept of network segmentation at all). Instead, network isolation can be implemented at arbitrary levels of granularity as desired.
As routing decisions are made by every node rather than dedicated hardware, implementing Network Policies will consume resources on each node. That said, the overhead is typically very minimal. See Cilium’s scalability report for more information.
By default, all traffic between all workloads on the Kubernetes network is allowed. This can be changed to deny-by-default if you prefer.

The current version of the Panfactum Stack supports Network Policies but does not yet implement them in our modules. As a result, you cannot turn on deny-by-default networking in the Panfactum Stack. This is coming in a future release.

TLS

Once connections are allowed via Network Policies, TLS and the associated X.509 certificate infrastructure are key to ensuring the resulting connection is encrypted and secure.

There are two types are X.509 certificates commonly found inside a Kubernetes cluster: private and public. Private certificates are used by the service mesh in order to secure internal traffic with mTLS. Public certificates are used by the ingress controllers to establish TLS connections with remote clients.

Regardless of public or private, certificate provisioning and lifecycle management is a complicated task. Fortunately, cert-manager exists to take care of these complexities. While not a built-in Kubernetes component, it is the de-facto way to manage certificates in the Kubernetes ecosystem and is included in the Panfactum Stack via kube_cert_manager.

Private Certificate Infrastructure

In the Panfactum Stack, private certificates are provisioned by Vault’s PKI secrets engine which provides a private certificate authority.

cert-manager is authorized to request and receive certificates for cluster workloads which are then distributed by the Linkerd controllers to the various linkerd-proxy sidecar containers. In this way, mTLS is implemented for all workloads running in the Panfactum Stack.

One exception to this are the certificates provided to Kubernetes webhook services that are deployed on the cluster. While cert-manager still manages the certificates, they are distributed by the CA Injector and by custom code inside the relevant Panfactum modules.

Public Certificate Infrastructure

In the Panfactum Stack, public certificates are provisioned by Let’s Encrypt through the DNS-01 challenge type.

This is automatically completed whenever you create a new ingress via our kube_ingress submodule.

This certificate provisioning process will only work if you have set up DNS in accordance with our bootstrapping guide.

Prior attempts such as the Service Mesh Interface have been discontinued. ↩
Why not expose your workloads directly to the public internet? There are a few reasons, but the most important is security. A special-purpose hardware load balancer is required to protect your workloads from certain attacks like DDOS attacks. A dedicated load balancer is also important for availability. Workloads on Kubernetes are typically ephemeral and are expected to be terminated periodically. On the other hand, load balancers are always available and can provide a buffer between your workloads and public clients which allows for traffic shifting without causing service disruptions. ↩
Why not connect your workloads directly to the hardware load balancer? While that is certainly possible, you will typically want a standard intermediate system that can handle the many standard concerns that you would otherwise need to solve for yourself. For example, a good Ingress controller will provide dozens of out-of-the-box capabilities such as proxy protocol support, TLS termination, rate-limiting, retries, request size limits, standard header implementations (e.g., CORS, CSP, etc.), a standard telemetry format, session stickiness, etc. ↩
In the future, Ingresses might be replaced with the upcoming Gateway API, but this is not yet something we’d recommend for use in production systems. ↩
And you can even have multiple ingress controllers in the same cluster. ↩
For more information on VPCs and subnets, see these docs. ↩
Note that EKS also deploys CoreDNS to newly provisioned clusters. However, their implementation is lacking in several areas such as in monitoring and high-availability, so we supply our own. ↩
If you host your DNS records in a system other than AWS Route 53, you will need to create your own deployment of ExternalDNS in order for DNS records to automatically sync with your cluster configuration. Fortunately, this is very easy. See the guides here. ↩
Typically, you can only connect with internal cluster Services ↩
This command is a building block used inside some of other commands such as pf-db-tunnel. ↩
remotehost:80 can be any address accessible from inside the cluster. This upstream server does not need to be inside the cluster. For example, both google.com:443 and postgres.default.svc.cluster.local:5432 are valid remote addresses. ↩
Due to ARP, traffic in the same subnet / VLAN does not need to be routed through a router or L3 switch and thus cannot have network firewall rules applied. ↩