Performance

Overview

The performance pillar focuses on the ability of your systems to deliver a responsive experience to end-users.

Optimizing performance in modern platform engineering requires addressing two distinct problems:

Autoscaling: Autoscaling (automatic scaling) ensures that workloads receive the compute resources necessary to execute successfully. Compute resources can include CPU shares, memory allocations, storage drives, and peripheral devices such as GPUs.

Assuming that workloads make valid resource requests, 1 there are three types of autoscaling:
- Vertical Autoscaling: Adjusting the resource requests of individual workloads based on historical resource usage. For example, increasing the amount of memory allocated to an individual server.
- Horizontal Autoscaling: Adjusting the number of workloads running based on overall system load. For example, provisioning additional webservers to handle load-balanced HTTP traffic.
- Infrastructure Autoscaling: When workloads are provisioned on discrete hosts, infrastructure autoscaling adjusts the number and type of hosts to ensure that all workload resource requests can be met. 2
Regardless of your system design, effective autoscaling depends on a few key assumptions:
- Workloads define the resources they need before they begin executing.
- Scaled workloads are homogenous in nature. 3
- Workload scaling (both up and down) does not cause service disruptions. 4
- Adjustments to workload scale can occur before workloads become resource constrained.
Unconstrained Latency: Latency measures time a single action takes to execute. Depending on the action, latency might include compute time, network transit time, external system response times, workload startup time, etc.

Unconstrained latency measures latency when a workload has received all its requested compute resources. 5

We care specifically about unconstrained latency because autoscaling will generally ensure workloads receive their requested resources. However, autoscaling can only take you so far. Even when a workload receives its requested resources, there are many additional factors to consider:
- Data can only travel so quickly between different systems.
- Storage drives only read and write data at certain speeds.
- No cloud provider can provision infinitely large instances.
- Some systems may not be able to be scaled, such as client's web browser or external vendor APIs.
- Autoscaling itself can only occur so rapidly.
- etc.
While less latency is always preferred, you must also be aware of hard limits on the maximum latency for certain actions. For example, your underlying infrastructure system may require you to be able to process inbound HTTP requests within a certain time window in order to prevent requests from being dropped. 6

Motivation

This pillar provides organization value for the following reasons:

Small differences in performance (< 100ms) can dramatically conversion rates (ref).
Perceived performance can dramatically impact user satisfaction (ref).
Performance impacts SEO and discoverability (ref).
Poor performance can lead to serious system issues:
- Actions exceeding certain latency thresholds might be terminated prior to completion (e.g., dropped HTTP requests).
- Improperly configured autoscaling can lead to resource contention that causes component crashes (e.g., process OOMs).
- Improperly configured autoscaling can lead to load-shedding (e.g., incoming actions such as HTTP requests cannot be processed at all).
Performant design decisions reduces the amount of future rework necessary as usage increases over time. 7

Benchmark

Metrics

These are common measurements that can help an organization assess their performance in this pillar. These are intended to be assessed within the context of performance on the key platform metrics, not used in a standalone manner.

Indicator	Business Impact	Ideal Target
P90 Time to First Byte (TTFB)	Represents the minimum amount of time any client-server round-trip will take that will impact the performance of all network requests.	< 200ms
P90 Largest Contentful Paint (LCP)	Represents the amount of time your client applications take to launch, a metric highly correlated to conversion and retention rates.	< 2.5s
P90 Interact to Next Paint (INP)	Represents the perceived responsiveness of your client applications, a metric highly correlated to user satisfaction.	< 200ms
P90 Blocking Time	User interactions with long blocking time are perceived to be application freezes / crashes by users.	< 50ms
P99 Network Request Latency	Long-lived network requests can occasionally be dropped using the default configurations in most systems, causing application errors.	< 10s
P90 Node Launch Time	New infrastructure nodes should launch relatively quickly to deal with autoscaling requests.	< 60s
P90 Container / Pod Startup Time	New servers should launch extremely quickly to deal with autoscaling requests.	< 10s
P50 Excess Compute Capacity	Systems should maintain excess capacity to handle sudden changes in load.	~ 30%

Organization Goals

These are common goals to help organizations improve their performance on the key platform metrics. While each goal represents a best practice, the level of impact and optimal approach will depend on your organization's unique context.

Category	Code	Goal
Autoscaling	`P.A.1`	All workloads define their resource requests (at minimum for CPU, memory, and storage).
	`P.A.2`	Workload resource requests and allocations are automatically updated based on historic usage.
	`P.A.3`	Workloads horizontally autoscale once they hit the maximum allowable resource request.
	`P.A.4`	Infrastructure automatically scales based on number of workloads and workload resource requests.
	`P.A.5`	Autoscaled workload units perform homogeneous actions.
	`P.A.6`	Workload scaling does not cause service disruptions.
	`P.A.7`	Workload scaling occurs in time to handle assigned load.
System Design	`P.D.1`	Public endpoints are deployed behind a CDN.
	`P.D.2`	Deployed CDNs cache static data.
	`P.D.3`	Network requests perform locality-aware routing (choosing the closest upstream servers).
	`P.D.4`	Network requests are load-balanced using a load-aware algorithm (i.e., not simple, round-robin).
	`P.D.5`	Databases are configured with appropriate indices.
	`P.D.6`	Databases read queries are distributed across all available read replicas (when applicable).
	`P.D.7`	Expensive, online database queries are precomputed and cached.
	`P.D.8`	Network requests avoid multiple round-trips whenever possible.
	`P.D.9`	Network requests have reasonable timeouts (< 10 seconds) to enable quick retries in case of unhealthy upstreams.
	`P.D.10`	First-party code avoids $\geq O(N^2)$ time complexity in algorithms.
Monitoring	`P.M.1`	Organization maintains a mechanism to simulate artificial load and regularly tests system capacity.
	`P.M.2`	System monitors deployed to detect sudden performance regressions in deployed code.
	`P.M.3`	System monitors deployed to detect actions that exceed predefined latency thresholds.
	`P.M.4`	System monitors deployed to detect system resource contention.
	`P.M.5`	Observability platform contains the ability to perform online code profiling.