Database Sharding and Consistent Hashing

As datasets exceed the storage and throughput limits of a single machine, systems must employ Sharding—the horizontal partitioning of data across a cluster of nodes. The primary technical challenge of sharding is deciding which piece of data lives on which node while maintaining the ability to scale elastically.

1. Sharding Strategies

Range-Based Sharding

Data is divided into continuous ranges based on a Shard Key (e.g., Users A–M on Node 1, N–Z on Node 2).

Strength: Highly efficient for range queries.
Weakness: Prone to Hotspots. If all new users start with the letter 'Z', Node 2 will be overloaded while Node 1 sits idle.

Directory-Based Sharding

A central "Lookup Service" tracks the mapping of keys to shards.

Strength: Total flexibility; individual rows can be moved between nodes easily.
Weakness: The directory becomes a performance bottleneck and a single point of failure.

Hash-Based Sharding (Naive)

Uses a simple modulo formula: node_id = hash(key) % N, where $N$ is the number of nodes.

The Modulo Problem: If you add or remove a node (changing $N$ ), the mapping for nearly every key in the system changes, triggering a catastrophic cluster-wide data migration.

2. The Consistent Hashing Solution

Consistent Hashing solves the "Modulo Problem" by decoupling keys from the number of physical nodes.

The Hash Ring

Ring Mapping: Both data keys and node IDs are hashed onto a logical circle (the Hash Ring) ranging from $0$ to $2^-1$.
Assignment: To find the location of a key, you hash it to a point on the ring and travel clockwise until you hit the first node. That node "owns" the key.
Elasticity: When a node is added, it only "steals" a small arc of keys from its immediate neighbor. On average, only $1/N$ of the data must be moved.

Virtual Nodes (vNodes)

To prevent uneven data distribution (where one node owns a larger slice of the ring than others), each physical server is hashed multiple times to different locations.

Benefit: If a node fails, its load is balanced across multiple other nodes in the cluster rather than overwhelming a single neighbor.

3. Comparison Summary

Feature	Naive Modulo	Consistent Hashing
Scaling Cost	Extreme (Remap all data)	Minimal (Remap $1/N$ data)
Complexity	Low	High (Ring management)
Data Balance	Uniform (Fixed)	Uniform (via vNodes)
Industry Standard	Legacy / Small scale	Cassandra, DynamoDB, Riak