Panfactum LogoPanfactum
PillarsObservability

Observability

Overview

Observability focuses on the active collection, parsing, indexing of live system data for the purposes of providing system insights, alerting, and auditability. Six types of data are typically collected:

  • Metrics: Metrics provide point-in-time measures of various system properties that are aggregated into time-series data stores. The vast majority of services will expose metrics in the industry-standard Prometheus or OpenMetrics data format.

  • Traces: Also known as "distributed traces," traces records the paths taken by data as it propagates through a system. Traces are composed of "spans" which track how long various system component spent processing the data. Spans record a location identifier, a start time, an end time, and other optional metadata. Tracing provides two key capabilities:

    • the ability quickly identify issues that occur at the granularity of specific actions such as individual HTTP requests

    • aggregated span metrics that can expose problems with specific system components such as performance issues or elevated error rates

  • Logs: Logs are the human-readable information output from a system component. Logs can either be in plaintext or a structured format such as JSON for easier parsing. Most system components output logs to stdout / stderr or a specific log file. An observability platform will collect logs from each component, parse them (add additional structure and metadata), and store them in a data store optimized for text searching.

    Of the three types of data, logs allow for the most verbosity and flexibility; however, they are also the most expensive to manage and store. Often logs have an associated "log level" (e.g., info, warn, error, etc.) that can help automated systems identify which are the most important. In distributed systems, they can also contain trace and / or span IDs to correlate verbose logs with specific traces.

  • Errors: Application code can often generate unhandled errors. 1 Error tracking involves deploying a global wrapper that captures all unhandled errors in order to send them to a central monitoring platform. 2

    While an "error" has language-specific meaning, the vast majority of languages have the concept of an Error / Exception object that contains a stack trace. This stack trace can be used to pinpoint the exact line of code that caused the issue, often with the help of uploaded source maps. 3 Often the error tracking can also capture other helpful data about the local state of the system when the error was generated.

  • Application Profiles: Many languages execute on a runtime (e.g., JVM, V8, etc.) that can provide performance data from the perspective of individual functions or methods. This includes granular information on memory and CPU statistics as well as language-specific information such as runtime locks. This can be deployed when trying to optimize performance-critical code paths often for the purposes of cost-savings. 4
  • Synthetic Tests: Synthetic tests are automated programs run against live systems for the purpose of collecting data in a controlled manner. These can be as simple as uptime tests such as repeatedly issuing an HTTP request against an endpoint to ensure it is online. These can also be complex end-to-end tests run against production a part of a continuous testing paradigm. This is explained further in our downtime visibility guide.

  • Real User Monitoring (RUM): For systems that have a client-side UI, a common practice is to embed monitoring scripts into the distributed binaries / bundles in order to track various aspects of the user experience. These tools can capture not only basic data such as load times, errors, and logs, but also more advanced information such as the entire user session. This can be helpful not just for debugging but also to inform product development.

A good observability platform will provide a unified mechanism for collecting, searching, analyzing, and monitoring each of the above data types. Once the data is collected, the most critical capabilities include:

  • Querying: All data should be easily accessible using a standard, well-documented query language. Retrieval of indexed data should happen nearly instantly.

  • Correlations: Data of one type should automatically link to data of another type. For example, traces should link to logs generated during that trace.

  • Alerting: The system should provide robust capabilities for defining automated monitors that trigger alerts when certain criteria are met (e.g., error rates spike). The system should be able to send alerts to arbitrary destinations as well as an integrated on-call platform.

  • Dashboards: Most users will not be masters of navigating the observability data. The platform should provide the ability to generate predefined dashboards to aid in top-level status reporting and user-driven debugging.

  • Archiving: Some types of observability data will need to be shipped to long-term storage for audit and compliance purposes; the system should easily support this archival functionality.

  • Cost Efficiency: Observability platforms tend to be one of the largest organizational costs and pricing can differ dramatically between vendors. At the same time, poorly implemented platforms can cost even more in second-order costs such as lost engineering time or unidentified user experience issues. Ensure you choose a platform that minimizes the sum of both the platform and second-order costs.

  • Access Controls: Often your observability platform will not only contain sensitive information about your users and your system at large, but it will also be used to provide an audit log and active alerting for security issues. The platform should contain a robust framework for assigning granular capabilities to its users.

Motivation

This pillar provides organization value for the following reasons:

  • Improves the ability to address production issues by

    • reducing the latency to issue identification (MTTD)
    • reducing the latency to issue resolution (MTTR)
    • providing necessary context that would otherwise make remediation impossible
  • Improves development velocity by

    • reducing the time spent on debugging
    • improving engineer confidence in system resiliency; coupled with automated rollbacks, observability tooling allows engineers to build and deploy features more rapidly
  • Informs product development by

    • allowing for a data-driven approach to issue prioritization
    • providing critical insights into user behavior during application use
  • Reduces operational costs by

    • highlighting inefficient code that bloats infrastructure costs
    • reducing the load placed on customer support teams
    • reducing the necessity for stringent QA test suites
  • Improves the user experience by

    • Reducing the frequency of systems problems and providing faster resolution for those that do occur
    • Highlighting UX degradations (e.g., application slowness) that would not otherwise be caught or reported
    • Building trust by providing visibility into the operational health of systems
  • Provides the necessary infrastructure and processes to meet certain compliance standards such as audit logging

Benchmark

Metrics

These are common measurements that can help an organization assess their performance in this pillar. These are intended to be assessed within the context of performance on the key platform metrics, not used in a standalone manner.

IndicatorBusiness ImpactIdeal Target
Percent of KPIs measuredSimply having the ability to measure platform KPIs is a core capability of the observability platform100%
Mean time to detect (MTTD)Detect issues before they become problems for end users< 5 min
Mean time to repairLimit time developers spent debugging and the impact of production issues to end users (measures only engineering time spent)< 1 day
Percent of services with tracing enabledTracing is a key observability data type100%
Percent of services with error tracking enabledErrors are a key observability data type100%
Percent of system logs collectedLogs are a key observability data type100%
Percent of system metrics collectedMetrics are a key observability data type100%
Percent of client applications with RUM enabledRUM is a key observability data type100%
Percent of critical application flows monitored with synthetic testsImproves downtime visibility100%
Percent of observed bugs that were first identified by an automated system monitorIssues detected by the automated monitoring system will be resolved significantly faster than those that are manually identified100%
Number of active alertsIf alerts are continually firing without being actively addressed, engineers will become desensitized and will miss new, critical issues.0
Percent of alerts that are triagedEvery system problem detected by the observability platform should be triaged. This serves to provide organizational awareness on important issues or refine the system to generate fewer false-positives.100%

Organization Goals

These are common goals to help organizations improve their performance on the key platform metrics. While each goal represents a best practice, the level of impact and optimal approach will depend on your organization's unique context.

CategoryCodeGoal
BasicsO.B.1Can collect metrics in Prometheus format.
O.B.2Can collect traces in OTEL format.
O.B.3Can collect and index logs.
O.B.4Can collect errors.
O.B.5Can collect application profiles.
O.B.6Can consume and display RUM data.
O.B.7Can easily correlate different types.
O.B.8A web UI for the observability platform.
O.B.9Can query arbitrary data and return results in less than five seconds.
O.B.10Can created automated monitors.
O.B.11Can issue alerts to arbitrary response platforms.
O.B.12Can create shared dashboards.
O.B.13Can implement federated authentication for access to the observability platform.
O.B.14Can provide access controls than can limit the scope of accessible data for each user.
O.B.15Can provide means to export data to a cold storage archive.
O.B.16Can configure observability tooling via an IaC tool.
O.B.17Captured data is tagged in some unified manner (e.g., environment, service, etc.).
O.B.18Injected data is made available for querying quickly (< 10 seconds).
Control Plane 5O.C.1Logs and metrics captured for all components of orchestration systems such as Kubernetes (e.g., kubelet, api-server, etc.).
O.C.2Logs and metrics captured for all control-plane workloads.
O.C.3Logs and metrics captured for all managed services for example via AWS CloudWatch.
O.C.4Logs and metrics captured for all version control systems.
O.C.5Logs and metrics captured for all CI / CD pipelines.
O.C.6Flow logs captured from system routers for example via AWS VPC Flow Logs.
O.C.7Instrumentation implemented for network interfaces, for example with Cilium.
O.C.8Tracing implemented for all control plane components that interact with the data path, for example with Ingress NGINX or Linkerd Proxy.
ServersO.S.1Servers include active monitoring of node health.
O.S.2Server syslogs captured.
Kubernetes 6O.K.1Pod metrics exposed and captured via metrics-server.
O.K.2Cluster state exposed and captured via kube-state-metrics.
O.K.3All Kubernetes events captured.
DatabasesO.Z.1Database queries are captured in traces.
O.Z.2Database query profiling is enabled.
O.Z.3Database metrics are captured.
Backend ApplicationsO.E.1Traces are captured for all applications.
O.E.2Logs are captured for all applications.
O.E.3Logs are exposed in a standard JSON format across all applications.
O.E.4Log levels are implemented and can be dynamically adjusted via environment variables.
O.E.5Logs are correlated to traces whenever possible.
O.E.6Error tracking is enabled.
O.E.6Application sourcemaps (or equivalent) are uploaded to observability platform.
O.E.7Application profiling can be enabled / disabled via environment variables and profiles are captured.
Frontend ApplicationsO.F.1User actions that create network requests create trace IDs.
O.F.2Logs are captured for all applications.
O.F.3Logs are exposed in a standard JSON format.
O.F.5Logs are correlated to traces whenever possible.
O.F.6Error / crash tracking is enabled.
O.F.6Application sourcemaps (or equivalent) are uploaded to observability platform.
O.F.7Load times are captured.
O.F.8Performance and error metrics on user-initiated actions are captured (e.g., button clicks).
O.F.9Can capture session replays.
O.F.10Session replies can be linked to errors that occurred during the session.
O.F.11Captured data includes activated feature flags, if applicable.
O.F.12User frustration signal are universally captured (rage clicks, dead clicks, etc.).
Uptime TestsO.T.1Basic uptime tests are continually (every ~10 seconds ) run against every endpoint from outside the private network, if applicable.
O.T.2Basic uptime tests are continually (every ~10 seconds ) run against every endpoint from inside the private network, if applicable.
O.T.3End-to-end tests are frequently run for every critical application flow and success status is captured.
O.T.4End-to-end tests do not pollute user metrics.
O.T.5The results of uptime tests are reported on a central dashboard that is updated in realtime.
MonitorsO.M.1Can alert when new errors are occurs (not alert on every error).
O.M.2Can alert when application performance on individual data paths exceeds predefined latency objectives.
O.M.3Can alert when error rates on individual data paths exceeds predefined latency objectives.
O.M.4Can alert when logs of severity error or higher are generated.
O.M.5Can alert when significant deviations from normal traffic patterns occur (e.g., DOS attacks).
O.M.6Applications include code that can manually trigger an alert.
O.M.7Can alert on new user frustration signals from client-side applications.
O.M.8Can alert on significant deviations in resource utilization (e.g., new memory leak, etc.).
O.M.9Can alert when synthetic tests fail.
O.M.10All monitors have a person or team directly responsible for responding when triggered.
AlertsO.A.1Alerts are automatically bucketed into predefined severity categories.
O.A.2A standard triaging process exists for evaluating alerts; the triaging process is linked in the alert.
O.A.3Severe alerts are routed to an on-call system.
O.A.4Alerts can be muted once triaged.
O.A.5Alerts include links to helpful debugging dashboards and / or queries.
O.A.6All active alerts are exposed in a centrally available location.
DashboardsO.D.1All services include a performance dashboard that highlights overall metrics and potential performance problems
O.D.2All applications include a error dashboard that highlights overall metrics and the most prevalent issues
O.D.3The overall platform KPIs have a top-level dashboard
O.D.4A shared system status dashboard exists highlighting any downtime or degradations occurring across the entire system
ArchivalO.S.1Data not regularly utilized is sent to an inexpensive, long-term storage location
O.S.2Archived data can be easily restored
O.S.3Archived data is immutable and tamper-resistant
Access ControlO.X.1Access within the observability platform follows the organization's standards for role-based access control (RBAC)
O.X.2All access to any control plane plane system is logged within the observability platform
O.X.3Access logs are automatically monitored for suspicious activity
O.X.4Access logs are archived indefinitely

Footnotes

  1. As opposed to "handled" errors which application developers have built specific mitigations for.

  2. Normally, you can also manually submit errors via the error tracking API.

  3. Or another language-specific equivalent such as debug symbols

  4. User-impacting performance problems are usually addressed by analyzing traces instead of this much more complex and granular data.

  5. The control plane is composed of all management services and is the part of the network that controls how user / application data is handled (e.g., a Kubernetes API server). In contrast, systems that handle user / application data directly are a part of the data plane.

  6. Only required if using Kubernetes.