Panfactum LogoPanfactum
PillarsPerformance

Performance

Overview

The performance pillar focuses on the ability of your systems to deliver a responsive experience to end-users.

Optimizing performance in modern platform engineering requires addressing two distinct problems:

  • Autoscaling: Autoscaling (automatic scaling) ensures that workloads receive the compute resources necessary to execute successfully. Compute resources can include CPU shares, memory allocations, storage drives, and peripheral devices such as GPUs.

    Assuming that workloads make valid resource requests, 1 there are three types of autoscaling:

    • Vertical Autoscaling: Adjusting the resource requests of individual workloads based on historical resource usage. For example, increasing the amount of memory allocated to an individual server.
    • Horizontal Autoscaling: Adjusting the number of workloads running based on overall system load. For example, provisioning additional webservers to handle load-balanced HTTP traffic.
    • Infrastructure Autoscaling: When workloads are provisioned on discrete hosts, infrastructure autoscaling adjusts the number and type of hosts to ensure that all workload resource requests can be met. 2

    Regardless of your system design, effective autoscaling depends on a few key assumptions:

    • Workloads define the resources they need before they begin executing.
    • Scaled workloads are homogenous in nature. 3
    • Workload scaling (both up and down) does not cause service disruptions. 4
    • Adjustments to workload scale can occur before workloads become resource constrained.
  • Unconstrained Latency: Latency measures time a single action takes to execute. Depending on the action, latency might include compute time, network transit time, external system response times, workload startup time, etc.

    Unconstrained latency measures latency when a workload has received all its requested compute resources. 5

    We care specifically about unconstrained latency because autoscaling will generally ensure workloads receive their requested resources. However, autoscaling can only take you so far. Even when a workload receives its requested resources, there are many additional factors to consider:

    • Data can only travel so quickly between different systems.
    • Storage drives only read and write data at certain speeds.
    • No cloud provider can provision infinitely large instances.
    • Some systems may not be able to be scaled, such as client's web browser or external vendor APIs.
    • Autoscaling itself can only occur so rapidly.
    • etc.

    While less latency is always preferred, you must also be aware of hard limits on the maximum latency for certain actions. For example, your underlying infrastructure system may require you to be able to process inbound HTTP requests within a certain time window in order to prevent requests from being dropped. 6

Motivation

This pillar provides organization value for the following reasons:

  • Small differences in performance (< 100ms) can dramatically conversion rates (ref).

  • Perceived performance can dramatically impact user satisfaction (ref).

  • Performance impacts SEO and discoverability (ref).

  • Poor performance can lead to serious system issues:

    • Actions exceeding certain latency thresholds might be terminated prior to completion (e.g., dropped HTTP requests).
    • Improperly configured autoscaling can lead to resource contention that causes component crashes (e.g., process OOMs).
    • Improperly configured autoscaling can lead to load-shedding (e.g., incoming actions such as HTTP requests cannot be processed at all).
  • Performant design decisions reduces the amount of future rework necessary as usage increases over time. 7

Benchmark

Metrics

These are common measurements that can help an organization assess their performance in this pillar. These are intended to be assessed within the context of performance on the key platform metrics, not used in a standalone manner.

IndicatorBusiness ImpactIdeal Target
P90 Time to First Byte (TTFB)Represents the minimum amount of time any client-server round-trip will take that will impact the performance of all network requests.< 200ms
P90 Largest Contentful Paint (LCP)Represents the amount of time your client applications take to launch, a metric highly correlated to conversion and retention rates.< 2.5s
P90 Interact to Next Paint (INP)Represents the perceived responsiveness of your client applications, a metric highly correlated to user satisfaction.< 200ms
P90 Blocking TimeUser interactions with long blocking time are perceived to be application freezes / crashes by users.< 50ms
P99 Network Request LatencyLong-lived network requests can occasionally be dropped using the default configurations in most systems, causing application errors.< 10s
P90 Node Launch TimeNew infrastructure nodes should launch relatively quickly to deal with autoscaling requests.< 60s
P90 Container / Pod Startup TimeNew servers should launch extremely quickly to deal with autoscaling requests.< 10s
P50 Excess Compute CapacitySystems should maintain excess capacity to handle sudden changes in load.~ 30%

Organization Goals

These are common goals to help organizations improve their performance on the key platform metrics. While each goal represents a best practice, the level of impact and optimal approach will depend on your organization's unique context.

CategoryCodeGoal
AutoscalingP.A.1All workloads define their resource requests (at minimum for CPU, memory, and storage).
P.A.2Workload resource requests and allocations are automatically updated based on historic usage.
P.A.3Workloads horizontally autoscale once they hit the maximum allowable resource request.
P.A.4Infrastructure automatically scales based on number of workloads and workload resource requests.
P.A.5Autoscaled workload units perform homogeneous actions.
P.A.6Workload scaling does not cause service disruptions.
P.A.7Workload scaling occurs in time to handle assigned load.
System DesignP.D.1Public endpoints are deployed behind a CDN.
P.D.2Deployed CDNs cache static data.
P.D.3Network requests perform locality-aware routing (choosing the closest upstream servers).
P.D.4Network requests are load-balanced using a load-aware algorithm (i.e., not simple, round-robin).
P.D.5Databases are configured with appropriate indices.
P.D.6Databases read queries are distributed across all available read replicas (when applicable).
P.D.7Expensive, online database queries are precomputed and cached.
P.D.8Network requests avoid multiple round-trips whenever possible.
P.D.9Network requests have reasonable timeouts (< 10 seconds) to enable quick retries in case of unhealthy upstreams.
P.D.10First-party code avoids O(N2)\geq O(N^2) time complexity in algorithms.
MonitoringP.M.1Organization maintains a mechanism to simulate artificial load and regularly tests system capacity.
P.M.2System monitors deployed to detect sudden performance regressions in deployed code.
P.M.3System monitors deployed to detect actions that exceed predefined latency thresholds.
P.M.4System monitors deployed to detect system resource contention.
P.M.5Observability platform contains the ability to perform online code profiling.

Footnotes

  1. Workloads should not make requests that cannot be fulfilled such as asking for instance sizes that your infrastructure provider cannot provision.

  2. In Kubernetes, this is referred to as cluster autoscaling.

  3. In other words, a single workload cannot be deployed to handle actions with vastly different performance characteristics. For example, you cannot use the same workload for both OLTP and OLAP.

  4. For example, when scaling down an HTTP service, requests are properly drained before the service exits.

  5. If a workload does not make an appropriate resource request, we must first address that issue before beginning to measure unconstrained latency. Latency in a resource-constrained scenario means we need to examine the autoscaling functionality.

  6. This is the case when using Kubernetes, for example.

  7. While premature optimization is the root of all evil, planning for future capacity needs is a practical and necessary engineering practice. A good rule of thumb is to ensure that your system design can handle 10x the current load with ease. Any further than that requires a more serious cost-benefit analysis.