Observability

Overview

Observability focuses on the active collection, parsing, indexing of live system data for the purposes of providing system insights, alerting, and auditability. Six types of data are typically collected:

Metrics: Metrics provide point-in-time measures of various system properties that are aggregated into time-series data stores. The vast majority of services will expose metrics in the industry-standard Prometheus or OpenMetrics data format.
Traces: Also known as "distributed traces," traces records the paths taken by data as it propagates through a system. Traces are composed of "spans" which track how long various system component spent processing the data. Spans record a location identifier, a start time, an end time, and other optional metadata. Tracing provides two key capabilities:
- the ability quickly identify issues that occur at the granularity of specific actions such as individual HTTP requests
- aggregated span metrics that can expose problems with specific system components such as performance issues or elevated error rates
Logs: Logs are the human-readable information output from a system component. Logs can either be in plaintext or a structured format such as JSON for easier parsing. Most system components output logs to stdout / stderr or a specific log file. An observability platform will collect logs from each component, parse them (add additional structure and metadata), and store them in a data store optimized for text searching.

Of the three types of data, logs allow for the most verbosity and flexibility; however, they are also the most expensive to manage and store. Often logs have an associated "log level" (e.g., info, warn, error, etc.) that can help automated systems identify which are the most important. In distributed systems, they can also contain trace and / or span IDs to correlate verbose logs with specific traces.
Errors: Application code can often generate unhandled errors. 1 Error tracking involves deploying a global wrapper that captures all unhandled errors in order to send them to a central monitoring platform. 2

While an "error" has language-specific meaning, the vast majority of languages have the concept of an Error / Exception object that contains a stack trace. This stack trace can be used to pinpoint the exact line of code that caused the issue, often with the help of uploaded source maps. 3 Often the error tracking can also capture other helpful data about the local state of the system when the error was generated.

Application Profiles: Many languages execute on a runtime (e.g., JVM, V8, etc.) that can provide performance data from the perspective of individual functions or methods. This includes granular information on memory and CPU statistics as well as language-specific information such as runtime locks. This can be deployed when trying to optimize performance-critical code paths often for the purposes of cost-savings. 4

Synthetic Tests: Synthetic tests are automated programs run against live systems for the purpose of collecting data in a controlled manner. These can be as simple as uptime tests such as repeatedly issuing an HTTP request against an endpoint to ensure it is online. These can also be complex end-to-end tests run against production a part of a continuous testing paradigm. This is explained further in our downtime visibility guide.
Real User Monitoring (RUM): For systems that have a client-side UI, a common practice is to embed monitoring scripts into the distributed binaries / bundles in order to track various aspects of the user experience. These tools can capture not only basic data such as load times, errors, and logs, but also more advanced information such as the entire user session. This can be helpful not just for debugging but also to inform product development.

A good observability platform will provide a unified mechanism for collecting, searching, analyzing, and monitoring each of the above data types. Once the data is collected, the most critical capabilities include:

Querying: All data should be easily accessible using a standard, well-documented query language. Retrieval of indexed data should happen nearly instantly.
Correlations: Data of one type should automatically link to data of another type. For example, traces should link to logs generated during that trace.
Alerting: The system should provide robust capabilities for defining automated monitors that trigger alerts when certain criteria are met (e.g., error rates spike). The system should be able to send alerts to arbitrary destinations as well as an integrated on-call platform.
Dashboards: Most users will not be masters of navigating the observability data. The platform should provide the ability to generate predefined dashboards to aid in top-level status reporting and user-driven debugging.
Archiving: Some types of observability data will need to be shipped to long-term storage for audit and compliance purposes; the system should easily support this archival functionality.
Cost Efficiency: Observability platforms tend to be one of the largest organizational costs and pricing can differ dramatically between vendors. At the same time, poorly implemented platforms can cost even more in second-order costs such as lost engineering time or unidentified user experience issues. Ensure you choose a platform that minimizes the sum of both the platform and second-order costs.
Access Controls: Often your observability platform will not only contain sensitive information about your users and your system at large, but it will also be used to provide an audit log and active alerting for security issues. The platform should contain a robust framework for assigning granular capabilities to its users.

Motivation

This pillar provides organization value for the following reasons:

Improves the ability to address production issues by
- reducing the latency to issue identification (MTTD)
- reducing the latency to issue resolution (MTTR)
- providing necessary context that would otherwise make remediation impossible
Improves development velocity by
- reducing the time spent on debugging
- improving engineer confidence in system resiliency; coupled with automated rollbacks, observability tooling allows engineers to build and deploy features more rapidly
Informs product development by
- allowing for a data-driven approach to issue prioritization
- providing critical insights into user behavior during application use
Reduces operational costs by
- highlighting inefficient code that bloats infrastructure costs
- reducing the load placed on customer support teams
- reducing the necessity for stringent QA test suites
Improves the user experience by
- Reducing the frequency of systems problems and providing faster resolution for those that do occur
- Highlighting UX degradations (e.g., application slowness) that would not otherwise be caught or reported
- Building trust by providing visibility into the operational health of systems
Provides the necessary infrastructure and processes to meet certain compliance standards such as audit logging

Benchmark

Metrics

These are common measurements that can help an organization assess their performance in this pillar. These are intended to be assessed within the context of performance on the key platform metrics, not used in a standalone manner.

Indicator	Business Impact	Ideal Target
Percent of KPIs measured	Simply having the ability to measure platform KPIs is a core capability of the observability platform	100%
Mean time to detect (MTTD)	Detect issues before they become problems for end users	< 5 min
Mean time to repair	Limit time developers spent debugging and the impact of production issues to end users (measures only engineering time spent)	< 1 day
Percent of services with tracing enabled	Tracing is a key observability data type	100%
Percent of services with error tracking enabled	Errors are a key observability data type	100%
Percent of system logs collected	Logs are a key observability data type	100%
Percent of system metrics collected	Metrics are a key observability data type	100%
Percent of client applications with RUM enabled	RUM is a key observability data type	100%
Percent of critical application flows monitored with synthetic tests	Improves downtime visibility	100%
Percent of observed bugs that were first identified by an automated system monitor	Issues detected by the automated monitoring system will be resolved significantly faster than those that are manually identified	100%
Number of active alerts	If alerts are continually firing without being actively addressed, engineers will become desensitized and will miss new, critical issues.	0
Percent of alerts that are triaged	Every system problem detected by the observability platform should be triaged. This serves to provide organizational awareness on important issues or refine the system to generate fewer false-positives.	100%

Organization Goals

These are common goals to help organizations improve their performance on the key platform metrics. While each goal represents a best practice, the level of impact and optimal approach will depend on your organization's unique context.

Category	Code	Goal
Basics	`O.B.1`	Can collect metrics in Prometheus format.
	`O.B.2`	Can collect traces in OTEL format.
	`O.B.3`	Can collect and index logs.
	`O.B.4`	Can collect errors.
	`O.B.5`	Can collect application profiles.
	`O.B.6`	Can consume and display RUM data.
	`O.B.7`	Can easily correlate different types.
	`O.B.8`	A web UI for the observability platform.
	`O.B.9`	Can query arbitrary data and return results in less than five seconds.
	`O.B.10`	Can created automated monitors.
	`O.B.11`	Can issue alerts to arbitrary response platforms.
	`O.B.12`	Can create shared dashboards.
	`O.B.13`	Can implement federated authentication for access to the observability platform.
	`O.B.14`	Can provide access controls than can limit the scope of accessible data for each user.
	`O.B.15`	Can provide means to export data to a cold storage archive.
	`O.B.16`	Can configure observability tooling via an IaC tool.
	`O.B.17`	Captured data is tagged in some unified manner (e.g., `environment`, `service`, etc.).
	`O.B.18`	Injected data is made available for querying quickly (< 10 seconds).
Control Plane 5	`O.C.1`	Logs and metrics captured for all components of orchestration systems such as Kubernetes (e.g., `kubelet`, `api-server`, etc.).
	`O.C.2`	Logs and metrics captured for all control-plane workloads.
	`O.C.3`	Logs and metrics captured for all managed services for example via AWS CloudWatch.
	`O.C.4`	Logs and metrics captured for all version control systems.
	`O.C.5`	Logs and metrics captured for all CI / CD pipelines.
	`O.C.6`	Flow logs captured from system routers for example via AWS VPC Flow Logs.
	`O.C.7`	Instrumentation implemented for network interfaces, for example with Cilium.
	`O.C.8`	Tracing implemented for all control plane components that interact with the data path, for example with Ingress NGINX or Linkerd Proxy.
Servers	`O.S.1`	Servers include active monitoring of node health.
	`O.S.2`	Server syslogs captured.
Kubernetes 6	`O.K.1`	Pod metrics exposed and captured via metrics-server.
	`O.K.2`	Cluster state exposed and captured via kube-state-metrics.
	`O.K.3`	All Kubernetes events captured.
Databases	`O.Z.1`	Database queries are captured in traces.
	`O.Z.2`	Database query profiling is enabled.
	`O.Z.3`	Database metrics are captured.
Backend Applications	`O.E.1`	Traces are captured for all applications.
	`O.E.2`	Logs are captured for all applications.
	`O.E.3`	Logs are exposed in a standard JSON format across all applications.
	`O.E.4`	Log levels are implemented and can be dynamically adjusted via environment variables.
	`O.E.5`	Logs are correlated to traces whenever possible.
	`O.E.6`	Error tracking is enabled.
	`O.E.6`	Application sourcemaps (or equivalent) are uploaded to observability platform.
	`O.E.7`	Application profiling can be enabled / disabled via environment variables and profiles are captured.
Frontend Applications	`O.F.1`	User actions that create network requests create trace IDs.
	`O.F.2`	Logs are captured for all applications.
	`O.F.3`	Logs are exposed in a standard JSON format.
	`O.F.5`	Logs are correlated to traces whenever possible.
	`O.F.6`	Error / crash tracking is enabled.
	`O.F.6`	Application sourcemaps (or equivalent) are uploaded to observability platform.
	`O.F.7`	Load times are captured.
	`O.F.8`	Performance and error metrics on user-initiated actions are captured (e.g., button clicks).
	`O.F.9`	Can capture session replays.
	`O.F.10`	Session replies can be linked to errors that occurred during the session.
	`O.F.11`	Captured data includes activated feature flags, if applicable.
	`O.F.12`	User frustration signal are universally captured (rage clicks, dead clicks, etc.).
Uptime Tests	`O.T.1`	Basic uptime tests are continually (every ~10 seconds ) run against every endpoint from outside the private network, if applicable.
	`O.T.2`	Basic uptime tests are continually (every ~10 seconds ) run against every endpoint from inside the private network, if applicable.
	`O.T.3`	End-to-end tests are frequently run for every critical application flow and success status is captured.
	`O.T.4`	End-to-end tests do not pollute user metrics.
	`O.T.5`	The results of uptime tests are reported on a central dashboard that is updated in realtime.
Monitors	`O.M.1`	Can alert when new errors are occurs (not alert on every error).
	`O.M.2`	Can alert when application performance on individual data paths exceeds predefined latency objectives.
	`O.M.3`	Can alert when error rates on individual data paths exceeds predefined latency objectives.
	`O.M.4`	Can alert when logs of severity `error` or higher are generated.
	`O.M.5`	Can alert when significant deviations from normal traffic patterns occur (e.g., DOS attacks).
	`O.M.6`	Applications include code that can manually trigger an alert.
	`O.M.7`	Can alert on new user frustration signals from client-side applications.
	`O.M.8`	Can alert on significant deviations in resource utilization (e.g., new memory leak, etc.).
	`O.M.9`	Can alert when synthetic tests fail.
	`O.M.10`	All monitors have a person or team directly responsible for responding when triggered.
Alerts	`O.A.1`	Alerts are automatically bucketed into predefined severity categories.
	`O.A.2`	A standard triaging process exists for evaluating alerts; the triaging process is linked in the alert.
	`O.A.3`	Severe alerts are routed to an on-call system.
	`O.A.4`	Alerts can be muted once triaged.
	`O.A.5`	Alerts include links to helpful debugging dashboards and / or queries.
	`O.A.6`	All active alerts are exposed in a centrally available location.
Dashboards	`O.D.1`	All services include a performance dashboard that highlights overall metrics and potential performance problems
	`O.D.2`	All applications include a error dashboard that highlights overall metrics and the most prevalent issues
	`O.D.3`	The overall platform KPIs have a top-level dashboard
	`O.D.4`	A shared system status dashboard exists highlighting any downtime or degradations occurring across the entire system
Archival	`O.S.1`	Data not regularly utilized is sent to an inexpensive, long-term storage location
	`O.S.2`	Archived data can be easily restored
	`O.S.3`	Archived data is immutable and tamper-resistant
Access Control	`O.X.1`	Access within the observability platform follows the organization's standards for role-based access control (RBAC)
	`O.X.2`	All access to any control plane plane system is logged within the observability platform
	`O.X.3`	Access logs are automatically monitored for suspicious activity
	`O.X.4`	Access logs are archived indefinitely