Database Sharding and Consistent Hashing
As datasets exceed the storage and throughput limits of a single machine, systems must employ **Sharding**—the horizontal partitioning of data across a cluster of nodes. The primary technical challenge of sharding is deciding which piece of data lives on which node while maintaining the ability to scale elastically.
1. Sharding Strategies
Range-Based Sharding
Data is divided into continuous ranges based on a **Shard Key** (e.g., Users A–M on Node 1, N–Z on Node 2).
* **Strength:** Highly efficient for range queries.
* **Weakness:** Prone to **Hotspots**. If all new users start with the letter 'Z', Node 2 will be overloaded while Node 1 sits idle.
Directory-Based Sharding
A central "Lookup Service" tracks the mapping of keys to shards.
* **Strength:** Total flexibility; individual rows can be moved between nodes easily.
* **Weakness:** The directory becomes a performance bottleneck and a single point of failure.
Hash-Based Sharding (Naive)
Uses a simple modulo formula: `node_id = hash(key) % N`, where $N$ is the number of nodes.
* **The Modulo Problem:** If you add or remove a node (changing $N$), the mapping for nearly every key in the system changes, triggering a catastrophic cluster-wide data migration.
2. The Consistent Hashing Solution
Consistent Hashing solves the "Modulo Problem" by decoupling keys from the number of physical nodes.
The Hash Ring
1. **Ring Mapping:** Both data keys and node IDs are hashed onto a logical circle (the **Hash Ring**) ranging from $0$ to $2^{n}-1$.
2. **Assignment:** To find the location of a key, you hash it to a point on the ring and travel **clockwise** until you hit the first node. That node "owns" the key.
3. **Elasticity:** When a node is added, it only "steals" a small arc of keys from its immediate neighbor. On average, only **$1/N$** of the data must be moved.
Virtual Nodes (vNodes)
To prevent uneven data distribution (where one node owns a larger slice of the ring than others), each physical server is hashed multiple times to different locations.
* **Benefit:** If a node fails, its load is balanced across multiple other nodes in the cluster rather than overwhelming a single neighbor.
3. Comparison Summary
| Feature | Naive Modulo | Consistent Hashing |
| :--- | :--- | :--- |
| **Scaling Cost** | **Extreme** (Remap all data) | **Minimal** (Remap $1/N$ data) |
| **Complexity** | Low | High (Ring management) |
| **Data Balance** | Uniform (Fixed) | Uniform (via vNodes) |
| **Industry Standard**| Legacy / Small scale | **Cassandra, DynamoDB, Riak** |
See Also
* [Distributed Systems Hub](DistributedSystemsHub) — Scaling foundations.
* [Majority Quorum](MajorityQuorum) — Managing consistency across shards.
* [Leader and Followers](LeaderAndFollowers) — Using replication within a shard.