Database Sharding and Consistent Hashing

As datasets exceed the storage and throughput limits of a single machine, systems must employ **Sharding**—the horizontal partitioning of data across a cluster of nodes. The primary technical challenge of sharding is deciding which piece of data lives on which node while maintaining the ability to scale elastically.

1. Sharding Strategies

Range-Based Sharding

Data is divided into continuous ranges based on a **Shard Key** (e.g., Users A–M on Node 1, N–Z on Node 2).

* **Strength:** Highly efficient for range queries.

* **Weakness:** Prone to **Hotspots**. If all new users start with the letter 'Z', Node 2 will be overloaded while Node 1 sits idle.

Directory-Based Sharding

A central "Lookup Service" tracks the mapping of keys to shards.

* **Strength:** Total flexibility; individual rows can be moved between nodes easily.

* **Weakness:** The directory becomes a performance bottleneck and a single point of failure.

Hash-Based Sharding (Naive)

Uses a simple modulo formula: `node_id = hash(key) % N`, where $N$ is the number of nodes.

* **The Modulo Problem:** If you add or remove a node (changing $N$), the mapping for nearly every key in the system changes, triggering a catastrophic cluster-wide data migration.

2. The Consistent Hashing Solution

Consistent Hashing solves the "Modulo Problem" by decoupling keys from the number of physical nodes.

The Hash Ring

1. **Ring Mapping:** Both data keys and node IDs are hashed onto a logical circle (the **Hash Ring**) ranging from $0$ to $2^{n}-1$.

2. **Assignment:** To find the location of a key, you hash it to a point on the ring and travel **clockwise** until you hit the first node. That node "owns" the key.

3. **Elasticity:** When a node is added, it only "steals" a small arc of keys from its immediate neighbor. On average, only **$1/N$** of the data must be moved.

Virtual Nodes (vNodes)

To prevent uneven data distribution (where one node owns a larger slice of the ring than others), each physical server is hashed multiple times to different locations.

* **Benefit:** If a node fails, its load is balanced across multiple other nodes in the cluster rather than overwhelming a single neighbor.

3. Comparison Summary

| Feature | Naive Modulo | Consistent Hashing |

| :--- | :--- | :--- |

| **Scaling Cost** | **Extreme** (Remap all data) | **Minimal** (Remap $1/N$ data) |

| **Complexity** | Low | High (Ring management) |

| **Data Balance** | Uniform (Fixed) | Uniform (via vNodes) |

| **Industry Standard**| Legacy / Small scale | **Cassandra, DynamoDB, Riak** |

See Also

* [Distributed Systems Hub](DistributedSystemsHub) — Scaling foundations.

* [Majority Quorum](MajorityQuorum) — Managing consistency across shards.

* [Leader and Followers](LeaderAndFollowers) — Using replication within a shard.