LeaderElectionAlgorithms
Leader election ensures that exactly one node in a cluster holds the authority to coordinate operations (e.g., write-serialization, task scheduling). Failure to maintain a unique leader results in **Split-Brain**, where multiple nodes claim authority, leading to state divergence and data corruption.
Why Leaders Matter
1. **Write Serialization:** Single-master systems (PostgreSQL, MySQL) require a leader to order transactions.
2. **Coordination:** Schedulers (Kubernetes, Nomad) need one source of truth for task placement.
3. **Efficiency:** It is faster to delegate to a leader than to run a full consensus round (Paxos/Raft) for every single read or write.
Consensus Algorithms: The Modern Standard
Modern systems use consensus-based election to prevent split-brain via a **Quorum ($N/2 + 1$)**.
1. Raft
Raft is designed for understandability and implements a strict leader hierarchy.
- **Election Timeout:** A follower becomes a **Candidate** if it hasn't heard a leader heartbeat.
- **Term:** A monotonically increasing counter. A candidate with a higher term wins.
- **Log Matching:** A candidate cannot be elected unless its log is as up-to-date as the majority.
2. Paxos
The foundation of distributed consensus. Mathematically exhaustive but complex to implement.
- **Proposer/Acceptor:** Nodes negotiate a proposal number.
- **Zab (Zookeeper):** A Paxos variant optimized for high-throughput broadcast.
Simple Election Patterns (Non-Consensus)
| Algorithm | Mechanism | Pros | Cons |
|---|---|---|---|
| **Bully** | Highest Node ID wins. | Simple. | Flapping if highest ID is unstable. |
| **Ring** | Nodes pass election tokens in a circle. | Deterministic. | High latency in large rings. |
| **Leases** | Leader holds a TTL-based lock in a store (etcd/Consul). | Easy to integrate. | Dependency on the lock store. |
Guarding Against Split-Brain: Fencing
When a leader is suspected dead and a new one is elected, the old leader might still be alive (e.g., during a long GC pause). **Fencing** prevents the zombie leader from causing harm.
- **Fencing Tokens:** Every election increments a token. The backend (Storage/DB) only accepts writes from the highest known token.
- **STONITH (Shoot The Other Node In The Head):** Physically power cycling the suspected failed node.
Comparison: Leader Election vs. Gossip
| Metric | Leader Election | [Gossip Protocol](GossipProtocol) |
|---|---|---|
| **Consistency** | Strong (Linearizable) | Eventual |
| **Coordination** | High (Stop-the-world) | Low (Peer-to-peer) |
| **Network Cost** | $O(N^2)$ during election | $O(N)$ constant |
| **Use Case** | Shared State, Transactions | Membership, Metrics |
Implementation Strategy
- **Don't implement Raft from scratch.** Use proven libraries (etcd `raft`, Hashicorp `raft`).
- **Use external coordinators** (etcd, Zookeeper) for application-level leadership rather than building it into every microservice.
- **Set aggressive heartbeats but conservative election timeouts** to minimize flapping.