Deploying Workloads: High Availability
Objective
Learn how to meet various SLA levels for your deployed workloads.
SLA Levels
The Panfactum stack is designed for high availability. However, extremely high uptime naturally leads to higher costs as greater redundancy and stricter scheduling constraints are required.
We provide the means to tune your infrastructure to choose the right balance between high availability and cost for each of your clusters and workloads. For example, you might prioritize cost savings in development / test environments while prioritizing availability in production environments.
Panfactum modules will adjust their default behavior based on the sla_target
Terragrunt variable. 1 2
That said, there are a handful of actions you must take to achieve any target in addition to setting the sla_target
variable. We detail those tuning requirements below.
Level 1: > 99.9% Uptime
Achieving this level will result in on average less than 45 minutes of workload downtime per month. 3
The Panfactum Stack is designed to achieve this level of availability regardless of how workloads are deployed; however, there are a handful of items that you must do as a cluster operator. These items are all free to do, so they should always be done as a best practice.
Required Tuning:
- Set
sla_target
to1
. - Set up liveness probes for your workloads to ensure that they will be restarted if they enter a failure state.
- Ensure that all workloads implement request draining. 4
- Ensure that all workloads can gracefully terminate in < 90 seconds.
Level 2: > 99.99% Uptime
Achieving this level will result in less than 4 minutes of workload downtime per month. It will result in approximately 50-150% higher costs than Level 1. 5
Required Tuning:
- Set
sla_target
to2
. - Either:
- (Cheaper) Ensure that all workloads that connect to Panfactum-provided databases connect via their corresponding HA mechanism. 6
- (Expensive) Disable disruptions in the relevant Panfactum-provided database modules.
- Everything else in Level 1.
Level 3: > 99.999% Uptime
Achieving this level will result in virtually no downtime (excepting a full regional AWS outage 7). It will result in approximately 30% higher costs than Level 2.
Required Tuning:
- Set
sla_target
to3
. - Either:
- Implement client-side retries for all network requests. 8
- Use Linkerd to implement automatic retries in the service-mesh layer.
- Everything else in Level 2.
Setting the SLA Target
FAQ
Can I really achieve high uptime while running workloads exclusively on spot instances?
You might be tempted to avoid scheduling instances on spot instances in order to achieve maximum uptime. This is unnecessary.
By following the above recommendations, it is entirely possible to schedule all your workloads on spot instances while still achieving 99.999% uptime. This is one of the benefits that Panfactum provides that allows you to save 90% or more on your infrastructure costs.
In general, spot interruptions are relatively infrequent. While the interruption rates differ by region, availability zone, and instance type, we generally only observe about one interruption per week per five instances in the cluster. Moreover, we have been able to successfully run some spot instances for multiple weeks without interruption.
Ultimately, we only recommend using on-demand instances for workloads that cannot gracefully terminate in under 90 seconds. In practice, this scenario tends to be exceptionally rare.
Footnotes
-
These recommendations are based on AWS's SLAs and assume that Amazon will achieve their SLA targets. Ultimately, if AWS itself suffers a complete failure, it is impossible to mitigate all the associated impacts. ↩
-
This default behavior can always be overridden by providing explicit inputs to any given module. ↩
-
This is an average metric. If you deploy to a single AZ, an AZ outage might knock your system offline for several hours. This is extremely rare, but certainly can and does happen. Additionally, even at Level 1, for most months you can expect to have about 99.99% uptime. ↩
-
When workload receives a
SIGTERM
, it should aim to finish processing requests it has already received rather than dropping them. ↩ -
This is highly variable and depends on the specific mix of resources that you are deploying on the cluster. Generally, the more databases that you are running, the higher the cost differential will be as data will need to be replicated across multiple AZs for resiliency. This means not only doubling the storage costs but also typically results in significant inter-AZ network fees. ↩
-
For example, use PgBouncer for PostgreSQL use PgBouncer or Sentinel for Redis. ↩
-
The general advice here is to avoid setting up your core systems in
us-east-1
as this is the only region with a history of full regional outages. ↩ -
Ensure you implement an exponential backoff algorithm to avoid thundering herd issues. ↩