Data Lakehouse: Level 4 Maturity

The **Data Lakehouse** represents Level 4 of the [Data Maturity Lifecycle](DataMaturityLifecycle). It unifies the scalability of a Data Lake with the transactional reliability (ACID) of a Data Warehouse by layering metadata management over open file formats like Parquet.

1. The Core Technology: Apache Iceberg

While multiple formats exist (Delta Lake, Hudi), **Apache Iceberg** is the industry standard for vendor-neutral Lakehouse implementations. It moves the source of truth from "file listing" (which is slow on S3) to explicit "metadata pointers."

Technical Layer: The Iceberg Metadata Tree

1. **Metadata File (.json):** The root. Tracks the current snapshot ID and table schema.

2. **Manifest List (.avro):** Points to a set of Manifest Files for a specific snapshot. Includes min/max statistics for partition pruning.

3. **Manifest File (.avro):** Tracks individual data files (Parquet) and their lower/upper bounds for every column.

2. Concrete Example: ACID "Upserts" on S3

In a traditional lake (Level 3), updating one row means rewriting a massive Parquet file. In a Lakehouse (Level 4), Iceberg handles this via **Merge-on-Read (MoR)** or **Copy-on-Write (CoW)**.

**SQL Implementation (Iceberg):**

```sql

-- Updating a single customer's status in a 10TB table

MERGE INTO silver.customers t

USING (SELECT 'C123' as id, 'Active' as status) s

ON t.customer_id = s.id

WHEN MATCHED THEN UPDATE SET t.status = s.status;

```

**What happens under the hood:**

- Iceberg writes a small **Delete File** (tracking the old record) and a new **Data File** (with the update).

- Readers merge these files at query time, ensuring they always see the latest committed state.

3. Performance: The "Snapshot" Advantage

Because Iceberg uses immutable snapshots, it enables **Time Travel**:

```sql

-- Query the state of the table as of yesterday

SELECT * FROM gold.revenue FOR TIMESTAMP AS OF '2026-05-19 12:00:00';

```

This is critical for debugging data pipelines and auditing financial records without maintaining manual backups.

4. The Bridge to Level 5

Level 4 provides the technical foundation, but it still assumes a central team manages the Lakehouse. Level 5, the [Data Mesh Architecture](DataMeshArchitecture), decentralizes this technical stack across domain owners.

---

**See Also:**

- [Data Lake Architecture](DataLakeArchitecture) — The foundation for Lakehouses.

- [Change Data Capture](ChangeDataCapture) — Streaming data into the Lakehouse.

- [Data Mesh Architecture](DataMeshArchitecture) — Decentralized Lakehouse ownership.

---