High Availability (HA): Engineering for Resilience

High Availability (HA) is the characteristic of a system that aims to ensure an agreed level of operational performance (usually uptime) for a higher than normal period. It is the practical application of distributed redundancy to combat hardware failure, network partitions, and software bugs.

1. The Mathematics of "Nines"

Availability ( $A$ ) is formally defined by Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR):

A = \frac{MTBF}{MTBF + MTTR}

Industry standards describe availability in "nines":

Three Nines (99.9%): ~8.77 hours of downtime per year. Typical for internal tools.
Four Nines (99.99%): ~52.6 minutes of downtime per year. Standard for commercial SaaS.
Five Nines (99.999%): ~5.26 minutes of downtime per year. Required for telecom and critical financial infrastructure.

Achieving higher nines exponentially increases cost and architectural complexity. It requires moving from reactive recovery to proactive, active-active topologies.

2. RTO and RPO: The Dual Metrics of Disaster Recovery

When failures occur, they are measured against two distinct Service Level Objectives (SLOs):

Recovery Time Objective (RTO): The maximum acceptable delay between the interruption of service and the restoration of service. (How long can we be down?)
Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. (How much data can we lose?)

A system with synchronous replication might have an RPO of 0 (no data lost) but an RTO of 5 minutes (time taken for a leader election to complete).

3. Core HA Topologies

A. Active-Passive (Cold/Warm Standby)

One primary node handles all traffic. A secondary node sits idle, receiving asynchronous replication.

Pros: Simple to implement, avoids split-brain scenarios.
Cons: Wasted compute resources. Failover is slow (high RTO) because the passive node must boot up or assume the leader role.

B. Active-Active (Multi-Primary)

Multiple nodes handle traffic simultaneously. State is synchronized across all nodes, often using advanced conflict resolution like CRDTs.

Pros: Near-zero RTO. Traffic is load-balanced, utilizing all hardware.
Cons: Extremely complex. Requires conflict resolution for concurrent writes (violates strict linearizability).

4. Modern Resilience Patterns

Cell-Based Architecture: Partitioning the system into isolated "cells" to strictly bound the blast radius of a failure. (See Cell-Based Architecture).
Redundancy at Every Layer: From N+1 power supplies in the data center to multi-AZ deployment and active-active global load balancing (e.g., Anycast IP routing).

See Also: