Write-Ahead Log (WAL)

The **Write-Ahead Log (WAL)** is a foundational pattern for providing durability and atomicity in databases and distributed systems. It resolves the conflict between the need for immediate data persistence and the high latency of random-access disk I/O by enforcing a "Log First, Act Later" protocol.

1. The Core Protocol

The WAL pattern mandates that every intended change to the system state must be written to a durable, append-only log file before the change is applied to the main data structures (e.g., B-Trees, SSTables, or In-Memory state).

The Write Sequence

1. **Request:** A client sends a write request.

2. **Append:** The system serializes the request into a **Log Record** and appends it to the end of the current log file.

3. **Synchronize:** The system executes an `fsync` or `force` call to ensure the OS buffers are flushed to physical media.

4. **Acknowledge:** Once the log is durable, the system acknowledges success to the client.

5. **Apply:** The change is asynchronously applied to the main data store (the "checkpointed" state).

2. Implementation Mechanics

Sequential vs. Random I/O

The primary performance advantage of WAL is that it converts non-contiguous updates to the data store (which require slow random seeks) into **sequential appends**. Sequential I/O is significantly faster on both spinning disks (HDDs) and flash storage (SSDs).

Group Commits

To maximize throughput, high-performance systems use **Group Commits**. Instead of calling `fsync` for every transaction, the system batches multiple pending log records into a single disk sync operation, amortizing the latency across many clients.

Log Record Structure

A robust WAL entry typically includes:

* **LSN (Log Sequence Number):** A monotonically increasing identifier used for ordering and recovery.

* **Checksum (CRC):** To detect data corruption during disk read.

* **Operation Type:** (e.g., `INSERT`, `UPDATE`, `DELETE`).

* **Redo Data:** The "new" value required to reconstruct the state during recovery.

3. Recovery and Checkpointing

Because the WAL grows with every write, it cannot be kept indefinitely. Systems manage this via **Checkpointing**.

1. **Flush State:** The system periodically flushes all pending in-memory changes to the main data files.

2. **Record Mark:** It writes a "Checkpoint" record to the WAL, noting the last LSN that is now "safe" in the main data files.

3. **Truncate:** All log segments older than the checkpoint are deleted or archived.

Crash Recovery Algorithm

If a system crashes and restarts, it enters the **Recovery Phase**:

* **Redo:** Starting from the last checkpoint, it replays all subsequent WAL records to restore the data store to the state it was in at the moment of the crash.

* **Integrity:** Because the log was flushed before the client was acknowledged, no "confirmed" data is ever lost.

4. Distributed WAL: The Basis of Consensus

In distributed clusters (using **Paxos** or **Raft**), the WAL is not merely local; it is a **Replicated Log**.

* **Quorum Replication:** A log entry is only considered "committed" (and thus "durable") once it has been successfully appended and synced on a **majority** of nodes in the cluster.

* **State Machine Replication:** Once a command is durably replicated in the log, every node applies it to their local state machine in the same order, guaranteeing consistency across the cluster.

See Also

* [Distributed Systems Hub](DistributedSystemsHub) — Central index of distributed patterns.

* [Majority Quorum](MajorityQuorum) — How majority overlap ensures consistency.

* [Consistent Core](ConsistentCore) — Using WAL-based clusters to manage metadata.