High Availability (HA): Engineering for Resilience

**High Availability (HA)** is the characteristic of a system that aims to ensure an agreed level of operational performance (usually uptime) for a higher than normal period. It is the practical application of distributed redundancy to combat hardware failure, network partitions, and software bugs.

1. The Mathematics of "Nines"

Availability ($A$) is formally defined by Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR):

$$ A = \frac{MTBF}{MTBF + MTTR} $$

Industry standards describe availability in "nines":

* **Three Nines (99.9%)**: ~8.77 hours of downtime per year. Typical for internal tools.

* **Four Nines (99.99%)**: ~52.6 minutes of downtime per year. Standard for commercial SaaS.

* **Five Nines (99.999%)**: ~5.26 minutes of downtime per year. Required for telecom and critical financial infrastructure.

Achieving higher nines exponentially increases cost and architectural complexity. It requires moving from reactive recovery to proactive, active-active topologies.

2. RTO and RPO: The Dual Metrics of Disaster Recovery

When failures occur, they are measured against two distinct Service Level Objectives (SLOs):

* **Recovery Time Objective (RTO)**: The maximum acceptable delay between the interruption of service and the restoration of service. (How long can we be down?)

* **Recovery Point Objective (RPO)**: The maximum acceptable amount of data loss measured in time. (How much data can we lose?)

A system with synchronous replication might have an RPO of 0 (no data lost) but an RTO of 5 minutes (time taken for a leader election to complete).

3. Core HA Topologies

A. Active-Passive (Cold/Warm Standby)

One primary node handles all traffic. A secondary node sits idle, receiving asynchronous replication.

* **Pros**: Simple to implement, avoids split-brain scenarios.

* **Cons**: Wasted compute resources. Failover is slow (high RTO) because the passive node must boot up or assume the leader role.

B. Active-Active (Multi-Primary)

Multiple nodes handle traffic simultaneously. State is synchronized across all nodes, often using advanced conflict resolution like CRDTs.

* **Pros**: Near-zero RTO. Traffic is load-balanced, utilizing all hardware.

* **Cons**: Extremely complex. Requires conflict resolution for concurrent writes (violates strict linearizability).

4. Modern Resilience Patterns

* **Cell-Based Architecture**: Partitioning the system into isolated "cells" to strictly bound the blast radius of a failure. (See [Cell-Based Architecture](CellBasedArchitecture)).

* **Redundancy at Every Layer**: From N+1 power supplies in the data center to multi-AZ deployment and active-active global load balancing (e.g., Anycast IP routing).

---

**See Also:**

- [Cell-Based Architecture](CellBasedArchitecture)

- [Leader and Followers](LeaderAndFollowers)

- [CAP Theorem](CapTheorem)