Hash Table Design
Hash tables are the most-used data structure in software. They give O(1) average lookup, insert, delete. The "average" hides important engineering decisions.
Understanding hash table design helps you use them well, avoid pathological cases, and know when to use alternatives.
The basic structure
- An array of slots (the table)
- A hash function: key → array index
- A collision resolution strategy
Insert: hash key, place in slot.
Lookup: hash key, check slot.
Delete: hash key, remove from slot.
Hash functions
A good hash function:
- Uniform distribution across slot range
- Fast to compute
- Avalanche effect: small input changes flip ~half the output bits
Common functions
- **FNV**: fast, simple, decent
- **MurmurHash**: fast, good distribution, used by Hadoop, Kafka
- **xxHash**: fastest non-cryptographic
- **CityHash, FarmHash**: Google's
- **SipHash**: secure against hash flooding
For application use: language standard library hash. Don't roll your own.
Cryptographic hashes
SHA-256 etc. Overkill for hash tables (slow). Use only when adversarial inputs matter.
Collision resolution
Two main strategies:
Chaining
Each slot is a linked list (or other structure). Collisions append to list.
Pros:
- Simple
- Handles high load factors gracefully
- Deletion is straightforward
Cons:
- Pointer chasing (cache unfriendly)
- Memory overhead (list nodes)
Open addressing
When slot is full, probe to find empty slot.
Probe sequences:
- **Linear**: try i+1, i+2, i+3, ... (cache friendly, but clustering)
- **Quadratic**: try i+1², i+2², ... (less clustering)
- **Double hashing**: use second hash as step (best uniformity)
Pros:
- Cache friendly (no pointer chasing)
- Lower memory overhead
Cons:
- Performance degrades sharply at high load
- Deletion is tricky (tombstones)
- Clustering can hurt linear probing
Robin Hood hashing
Open addressing variant. When inserting, if you encounter a record with smaller probe distance, swap.
Equalizes probe distances. Practical and fast.
Cuckoo hashing
Two hash functions; each key has two possible slots. Insert: place in either; if both occupied, evict and reinsert.
Worst-case O(1) lookup. Insertion can fail (need rehash).
Hopscotch hashing
Combines open addressing with bounded distance. Each key within H slots of its ideal position.
Load factor
Load factor α = filled slots / total slots.
Performance vs load factor depends on strategy:
- Chaining: degrades gradually
- Open addressing: degrades sharply above ~0.7
- Robin Hood: works well to ~0.9
Resizing
When load factor exceeds threshold, resize the table (double capacity, rehash all keys).
Cost
Single resize: O(n).
Amortized over insertions: O(1).
Incremental rehashing
For latency-sensitive applications: rehash a few entries per operation rather than all at once.
Used in Redis.
Modern implementations
Java HashMap
Chaining. Treeifies long chains for worst-case O(log n).
Default load factor 0.75.
C++ std::unordered_map
Chaining. Often slower than alternatives due to spec constraints.
Many use absl::flat_hash_map or robin_hood instead.
Python dict
Open addressing with perturbation-based probing.
Insertion-ordered since Python 3.7.
Go map
Open addressing variant. Uses 8-key buckets for cache locality.
Rust HashMap
SipHash by default (DoS-resistant). Open addressing (Robin Hood-style).
Performance considerations
Cache locality
Open addressing wins for cache. Each probe is contiguous memory.
Chaining: pointer chasing, cache misses.
Branch prediction
Linear probing has predictable access patterns.
Hash quality
Bad hash → many collisions → terrible performance.
Test hash quality with key distribution from production.
Memory layout
Pack data tightly. Group hot fields. Avoid pointer indirection.
absl::flat_hash_map is fast partly because it stores keys+values inline.
Hash flooding attacks
Adversary submits keys all hashing to same slot. Hash table degrades to linked list.
Mitigations:
- Random hash seed per process
- SipHash or other DoS-resistant hash
- Treeify long chains (Java approach)
Many languages randomize hash seeds by default. Some still vulnerable.
When NOT to use hash table
Need ordered iteration
Use a tree (TreeMap, std::map).
Worst-case bounds matter
Hash tables have O(n) worst case. For real-time systems, may not be acceptable.
Range queries
Hash tables can't do range queries. Use trees.
Very small data
For small N (<10-20), linear search is faster.
Frequent updates with unstable hash
If keys change, hash tables break.
Specialized variants
Concurrent hash tables
Multiple threads access simultaneously.
- Java: ConcurrentHashMap (lock striping)
- Java: high-performance: NonBlockingHashMap
- C++: tbb::concurrent_hash_map
Lock-free designs exist but are complex.
Persistent / immutable hash tables
Update returns a new map; old map unchanged.
Used in functional languages (Clojure, Scala).
Hash array mapped tries (HAMT) are a common implementation.
Bloom filters
Probabilistic. Tells you "definitely not" or "maybe yes."
Useful for cache filters: skip expensive lookup if Bloom says no.
Cuckoo filters
Like Bloom but supports deletion.
Count-min sketch
Probabilistic counting. For analytics, frequency estimation.
Memory overhead
Empty hash table: pointer + size + capacity. Tens of bytes.
Per entry:
- Chaining: key + value + pointer + (allocation overhead)
- Open addressing: key + value + (hash if cached)
Open addressing typically lower per-entry overhead.
Common failure patterns
Custom hash that's bad
Random-seeming functions often have bad distribution. Use library hashes.
Mutating keys after insertion
Hash changes; key unfindable.
Insufficient capacity
Hash tables resize, but resizing during a critical path causes latency spikes.
Adversarial inputs
User-supplied keys without DoS-resistant hash.
Wrong load factor
Default usually fine; tune only with measurement.
Iterating during modification
Behavior is implementation-specific. Often UB or exception.
Engineering recommendation
For most application code:
1. Use the language standard library hash table
2. Don't reimplement
3. Benchmark only if performance is suspect
4. Consider absl::flat_hash_map (C++), or hashbrown (Rust default)
When you do need to optimize:
1. Profile to confirm hash table is the bottleneck
2. Test with realistic key distribution
3. Measure load factor in production
4. Consider open-addressing variant if cache-bound
Further Reading
- [BalancedSearchTrees](BalancedSearchTrees) — Ordered alternative
- [HeapAndPriorityQueues](HeapAndPriorityQueues) — Different access pattern
- [Data Structures Hub](DataStructuresHub) — Cluster index