Write-Ahead Log (WAL)

The Write-Ahead Log (WAL) is a foundational pattern for providing durability and atomicity in databases and distributed systems. It resolves the conflict between the need for immediate data persistence and the high latency of random-access disk I/O by enforcing a "Log First, Act Later" protocol.

1. The Core Protocol

The WAL pattern mandates that every intended change to the system state must be written to a durable, append-only log file before the change is applied to the main data structures (e.g., B-Trees, SSTables, or In-Memory state).

The Write Sequence

Request: A client sends a write request.
Append: The system serializes the request into a Log Record and appends it to the end of the current log file.
Synchronize: The system executes an fsync or force call to ensure the OS buffers are flushed to physical media.
Acknowledge: Once the log is durable, the system acknowledges success to the client.
Apply: The change is asynchronously applied to the main data store (the "checkpointed" state).

2. Implementation Mechanics

Sequential vs. Random I/O

The primary performance advantage of WAL is that it converts non-contiguous updates to the data store (which require slow random seeks) into sequential appends. Sequential I/O is significantly faster on both spinning disks (HDDs) and flash storage (SSDs).

Group Commits

To maximize throughput, high-performance systems use Group Commits. Instead of calling fsync for every transaction, the system batches multiple pending log records into a single disk sync operation, amortizing the latency across many clients.

Log Record Structure

A robust WAL entry typically includes:

LSN (Log Sequence Number): A monotonically increasing identifier used for ordering and recovery.
Checksum (CRC): To detect data corruption during disk read.
Operation Type: (e.g., INSERT, UPDATE, DELETE).
Redo Data: The "new" value required to reconstruct the state during recovery.

3. Recovery and Checkpointing

Because the WAL grows with every write, it cannot be kept indefinitely. Systems manage this via Checkpointing.

Flush State: The system periodically flushes all pending in-memory changes to the main data files.
Record Mark: It writes a "Checkpoint" record to the WAL, noting the last LSN that is now "safe" in the main data files.
Truncate: All log segments older than the checkpoint are deleted or archived.

Crash Recovery Algorithm

If a system crashes and restarts, it enters the Recovery Phase:

Redo: Starting from the last checkpoint, it replays all subsequent WAL records to restore the data store to the state it was in at the moment of the crash.
Integrity: Because the log was flushed before the client was acknowledged, no "confirmed" data is ever lost.

4. Distributed WAL: The Basis of Consensus

In distributed clusters (using Paxos or Raft), the WAL is not merely local; it is a Replicated Log.

Quorum Replication: A log entry is only considered "committed" (and thus "durable") once it has been successfully appended and synced on a majority of nodes in the cluster.
State Machine Replication: Once a command is durably replicated in the log, every node applies it to their local state machine in the same order, guaranteeing consistency across the cluster.