Memory Architectures
Memory in modern computers is hierarchical, not uniform. Understanding the hierarchy explains why some code is fast and some slow even when the algorithms are the same.
This page covers the memory hierarchy and its software implications.
The memory hierarchy
From fastest to slowest:
| Level | Capacity | Latency | Bandwidth |
|-------|----------|---------|-----------|
| Registers | ~256 bytes | <1 ns | ~TB/s |
| L1 cache | 32-64 KB | ~1 ns | hundreds of GB/s |
| L2 cache | 256 KB - 2 MB | ~3-10 ns | hundreds of GB/s |
| L3 cache | 4-128 MB | ~10-30 ns | hundreds of GB/s |
| RAM | GBs | ~50-100 ns | ~25-50 GB/s |
| SSD | TBs | ~10-100 μs | GB/s |
| HDD | TBs | ~10 ms | hundreds of MB/s |
| Network storage | PBs | ms+ | depends |
Each level is roughly 10x slower and 10x larger than the level above.
Why hierarchy
Single fast/large memory would be ideal. Physics and economics conflict:
- Fast memory is small and expensive
- Large memory is slow and cheap
Hierarchy gives an effective speed close to fastest level (when cache hits) and capacity of slowest level.
CPU caches
L1
Per-core. Split into instruction (L1i) and data (L1d).
Smallest, fastest. Hit determines whether instruction runs immediately or waits.
L2
Per-core (usually). Larger, slightly slower.
L3
Shared across cores. Largest CPU cache.
Provides cache coherence reference for inter-core communication.
Cache line
Unit of cache management. Typically 64 bytes.
Memory is read/written in cache lines. Touching one byte loads 64.
Has profound implications for data layout.
Cache behavior
Hit / miss
Cache hit: data in cache, fast access.
Cache miss: must fetch from slower level. Slow.
Prefetching
CPU predicts what memory you'll access; loads it early.
Sequential access patterns prefetch well; random doesn't.
Eviction
Cache full; old data evicted to make room.
Policies: LRU (or approximations), random.
Cache-friendly programming
Sequential access
Iterate through arrays in order. Prefetcher loves this.
Cache-line-aligned data
Avoid splitting hot data across cache lines.
Compact data
Smaller types fit more in cache.
Avoid pointer chasing
Linked lists, trees with pointers cause cache misses.
Arrays are cache-friendly; pointer-based structures often aren't.
Locality of reference
Use data soon after touching nearby data. Stays in cache.
False sharing
Multiple cores writing different fields in same cache line. Each write invalidates other cores' caches.
Symptom: parallel code mysteriously slow.
Solution: pad data to cache line boundaries; separate hot fields.
NUMA (Non-Uniform Memory Access)
In multi-socket systems:
- Each CPU has local memory
- Remote memory (other CPU's local) is slower
NUMA-aware programming:
- Pin threads to cores
- Allocate memory near where it's used
- Avoid cross-socket data sharing
OS schedulers and allocators are NUMA-aware to some extent. Critical workloads need explicit attention.
Virtual memory
OS abstraction: each process sees its own address space.
Page tables
Map virtual addresses to physical. Hardware MMU translates.
TLB (Translation Lookaside Buffer)
Caches recent translations. TLB miss is expensive (full page table walk).
Page faults
Virtual page not in physical memory. OS handles, possibly loading from disk.
Major faults (disk I/O) are very slow (~10ms).
Huge pages
Standard pages: 4KB. Huge: 2MB or 1GB.
Fewer page table entries; better TLB hit rate.
Used for performance-critical apps with large working sets.
RAM
Modern DDR4/DDR5:
- DDR4: ~25 GB/s per channel
- DDR5: ~50 GB/s per channel
- Multi-channel: 2x or more
Latency hasn't improved much over decades. Bandwidth has.
For LLM inference: bandwidth is often the bottleneck.
Memory bandwidth
Sequential reads can hit max bandwidth.
Random reads: latency-bound; effective bandwidth much lower.
Implication
A program reading random 8-byte values from RAM: ~50ns each. For 25 GB/s peak, that's 5 GB/s actually achieved (160M ops/sec). Sequential could hit 25 GB/s of real data.
Persistent storage
SSD
NAND flash. Block-based read/write.
Read latency: tens of microseconds.
Write latency: depends on technology.
Wear: limited write cycles per cell.
NVMe: PCIe-connected SSD. Fast.
HDD
Spinning platters. Mechanical seek time dominates.
Sequential: ~200 MB/s.
Random: limited by seek time (~10ms each).
Persistent memory (Optane was discontinued)
Byte-addressable persistent memory. Slower than DRAM, faster than SSD.
Niche product line.
Memory access patterns
Streaming / sequential
Hardware prefetcher kicks in. Effective bandwidth approaches peak.
Strided
Access pattern with fixed stride. Prefetcher may detect.
If stride > cache line, every access is a miss.
Random
Worst case. Each access pays full latency.
Implications for algorithms
Cache-oblivious algorithms
Designed to work well at all levels of hierarchy without tuning.
Examples: cache-oblivious matrix multiplication.
Blocking
Process data in chunks that fit in cache.
Matrix multiplication: blocking is huge speedup.
Locality-aware data structures
B-trees over binary trees for cache friendliness.
Arrays of structures (AoS) vs structures of arrays (SoA) tradeoffs.
Memory-efficient algorithms
In-place where possible. Reduces working set.
Modern challenges
Memory wall
CPU speeds grew faster than memory speeds. Memory access dominates many workloads.
Solution: more cache, smarter prefetching, GPU-style throughput computing.
Energy
Memory access uses far more energy than computation.
For mobile / edge: reducing memory traffic matters.
Data layout patterns
Array of structures (AoS)
```c
struct Point { float x, y, z; };
Point points[N];
```
Good when accessing whole structs.
Structure of arrays (SoA)
```c
float xs[N], ys[N], zs[N];
```
Good when accessing one field across many elements (vectorization).
Mixed / hybrid
Critical data hot, less-used cold. Lay out to keep hot data dense.
Common failure patterns
Treating memory as flat
Code that ignores locality is slow.
Pointer-heavy structures
Linked lists, deep object graphs. Cache misses dominate.
Ignoring NUMA
In multi-socket systems, ignoring NUMA wastes bandwidth.
False sharing
Subtle multi-threaded performance bug.
Random access at scale
When data exceeds RAM, random access becomes catastrophic.
Page table thrashing
Working set exceeds TLB. Lots of page table walks.
Profiling
Tools:
- perf (Linux): cache misses, branch mispredictions
- Intel VTune: detailed cache analysis
- LIKWID, PAPI: counter access
- AMD uProf: AMD equivalent
Key counters:
- Cache misses per instruction
- TLB misses
- Memory bandwidth
- Cycles stalled on memory
Practical advice
1. Default to arrays over linked structures
2. Profile cache misses on hot paths
3. Pack data, align hot fields
4. Iterate in cache-friendly order
5. For multi-threaded: avoid false sharing
6. For NUMA systems: pin threads and allocate locally
For most application code: standard library defaults are good. Optimization matters when profiling reveals memory-bound code.
Further Reading
- [MemoryManagementFundamentals](MemoryManagementFundamentals) — Allocator design
- [JavaMemoryManagement](JavaMemoryManagement) — JVM specifics
- [QuantumComputing](QuantumComputing) — Different paradigm
- [Computer Science Foundations Hub](ComputerScienceFoundationsHub) — Cluster index