Performance
Overview
The performance pillar focuses on the ability of your systems to deliver a responsive experience to end-users.
Optimizing performance in modern platform engineering requires addressing two distinct problems:
-
Autoscaling: Autoscaling (automatic scaling) ensures that workloads receive the compute resources necessary to execute successfully. Compute resources can include CPU shares, memory allocations, storage drives, and peripheral devices such as GPUs.
Assuming that workloads make valid resource requests, 1 there are three types of autoscaling:
- Vertical Autoscaling: Adjusting the resource requests of individual workloads based on historical resource usage. For example, increasing the amount of memory allocated to an individual server.
- Horizontal Autoscaling: Adjusting the number of workloads running based on overall system load. For example, provisioning additional webservers to handle load-balanced HTTP traffic.
- Infrastructure Autoscaling: When workloads are provisioned on discrete hosts, infrastructure autoscaling adjusts the number and type of hosts to ensure that all workload resource requests can be met. 2
Regardless of your system design, effective autoscaling depends on a few key assumptions:
-
Unconstrained Latency: Latency measures time a single action takes to execute. Depending on the action, latency might include compute time, network transit time, external system response times, workload startup time, etc.
Unconstrained latency measures latency when a workload has received all its requested compute resources. 5
We care specifically about unconstrained latency because autoscaling will generally ensure workloads receive their requested resources. However, autoscaling can only take you so far. Even when a workload receives its requested resources, there are many additional factors to consider:
- Data can only travel so quickly between different systems.
- Storage drives only read and write data at certain speeds.
- No cloud provider can provision infinitely large instances.
- Some systems may not be able to be scaled, such as client's web browser or external vendor APIs.
- Autoscaling itself can only occur so rapidly.
- etc.
While less latency is always preferred, you must also be aware of hard limits on the maximum latency for certain actions. For example, your underlying infrastructure system may require you to be able to process inbound HTTP requests within a certain time window in order to prevent requests from being dropped. 6
Motivation
This pillar provides organization value for the following reasons:
-
Small differences in performance (< 100ms) can dramatically conversion rates (ref).
-
Perceived performance can dramatically impact user satisfaction (ref).
-
Performance impacts SEO and discoverability (ref).
-
Poor performance can lead to serious system issues:
- Actions exceeding certain latency thresholds might be terminated prior to completion (e.g., dropped HTTP requests).
- Improperly configured autoscaling can lead to resource contention that causes component crashes (e.g., process OOMs).
- Improperly configured autoscaling can lead to load-shedding (e.g., incoming actions such as HTTP requests cannot be processed at all).
-
Performant design decisions reduces the amount of future rework necessary as usage increases over time. 7
Benchmark
Metrics
These are common measurements that can help an organization assess their performance in this pillar. These are intended to be assessed within the context of performance on the key platform metrics, not used in a standalone manner.
Indicator | Business Impact | Ideal Target |
---|---|---|
P90 Time to First Byte (TTFB) | Represents the minimum amount of time any client-server round-trip will take that will impact the performance of all network requests. | < 200ms |
P90 Largest Contentful Paint (LCP) | Represents the amount of time your client applications take to launch, a metric highly correlated to conversion and retention rates. | < 2.5s |
P90 Interact to Next Paint (INP) | Represents the perceived responsiveness of your client applications, a metric highly correlated to user satisfaction. | < 200ms |
P90 Blocking Time | User interactions with long blocking time are perceived to be application freezes / crashes by users. | < 50ms |
P99 Network Request Latency | Long-lived network requests can occasionally be dropped using the default configurations in most systems, causing application errors. | < 10s |
P90 Node Launch Time | New infrastructure nodes should launch relatively quickly to deal with autoscaling requests. | < 60s |
P90 Container / Pod Startup Time | New servers should launch extremely quickly to deal with autoscaling requests. | < 10s |
P50 Excess Compute Capacity | Systems should maintain excess capacity to handle sudden changes in load. | ~ 30% |
Organization Goals
These are common goals to help organizations improve their performance on the key platform metrics. While each goal represents a best practice, the level of impact and optimal approach will depend on your organization's unique context.
Category | Code | Goal |
---|---|---|
Autoscaling | P.A.1 | All workloads define their resource requests (at minimum for CPU, memory, and storage). |
P.A.2 | Workload resource requests and allocations are automatically updated based on historic usage. | |
P.A.3 | Workloads horizontally autoscale once they hit the maximum allowable resource request. | |
P.A.4 | Infrastructure automatically scales based on number of workloads and workload resource requests. | |
P.A.5 | Autoscaled workload units perform homogeneous actions. | |
P.A.6 | Workload scaling does not cause service disruptions. | |
P.A.7 | Workload scaling occurs in time to handle assigned load. | |
System Design | P.D.1 | Public endpoints are deployed behind a CDN. |
P.D.2 | Deployed CDNs cache static data. | |
P.D.3 | Network requests perform locality-aware routing (choosing the closest upstream servers). | |
P.D.4 | Network requests are load-balanced using a load-aware algorithm (i.e., not simple, round-robin). | |
P.D.5 | Databases are configured with appropriate indices. | |
P.D.6 | Databases read queries are distributed across all available read replicas (when applicable). | |
P.D.7 | Expensive, online database queries are precomputed and cached. | |
P.D.8 | Network requests avoid multiple round-trips whenever possible. | |
P.D.9 | Network requests have reasonable timeouts (< 10 seconds) to enable quick retries in case of unhealthy upstreams. | |
P.D.10 | First-party code avoids time complexity in algorithms. | |
Monitoring | P.M.1 | Organization maintains a mechanism to simulate artificial load and regularly tests system capacity. |
P.M.2 | System monitors deployed to detect sudden performance regressions in deployed code. | |
P.M.3 | System monitors deployed to detect actions that exceed predefined latency thresholds. | |
P.M.4 | System monitors deployed to detect system resource contention. | |
P.M.5 | Observability platform contains the ability to perform online code profiling. |
Footnotes
-
Workloads should not make requests that cannot be fulfilled such as asking for instance sizes that your infrastructure provider cannot provision. ↩
-
In Kubernetes, this is referred to as cluster autoscaling. ↩
-
In other words, a single workload cannot be deployed to handle actions with vastly different performance characteristics. For example, you cannot use the same workload for both OLTP and OLAP. ↩
-
For example, when scaling down an HTTP service, requests are properly drained before the service exits. ↩
-
If a workload does not make an appropriate resource request, we must first address that issue before beginning to measure unconstrained latency. Latency in a resource-constrained scenario means we need to examine the autoscaling functionality. ↩
-
This is the case when using Kubernetes, for example. ↩
-
While premature optimization is the root of all evil, planning for future capacity needs is a practical and necessary engineering practice. A good rule of thumb is to ensure that your system design can handle 10x the current load with ease. Any further than that requires a more serious cost-benefit analysis. ↩