Data Lakehouse: Level 4 Maturity
The **Data Lakehouse** represents Level 4 of the [Data Maturity Lifecycle](DataMaturityLifecycle). It unifies the scalability of a Data Lake with the transactional reliability (ACID) of a Data Warehouse by layering metadata management over open file formats like Parquet.
1. The Core Technology: Apache Iceberg
While multiple formats exist (Delta Lake, Hudi), **Apache Iceberg** is the industry standard for vendor-neutral Lakehouse implementations. It moves the source of truth from "file listing" (which is slow on S3) to explicit "metadata pointers."
Technical Layer: The Iceberg Metadata Tree
1. **Metadata File (.json):** The root. Tracks the current snapshot ID and table schema.
2. **Manifest List (.avro):** Points to a set of Manifest Files for a specific snapshot. Includes min/max statistics for partition pruning.
3. **Manifest File (.avro):** Tracks individual data files (Parquet) and their lower/upper bounds for every column.
2. Concrete Example: ACID "Upserts" on S3
In a traditional lake (Level 3), updating one row means rewriting a massive Parquet file. In a Lakehouse (Level 4), Iceberg handles this via **Merge-on-Read (MoR)** or **Copy-on-Write (CoW)**.
**SQL Implementation (Iceberg):**
```sql
-- Updating a single customer's status in a 10TB table
MERGE INTO silver.customers t
USING (SELECT 'C123' as id, 'Active' as status) s
ON t.customer_id = s.id
WHEN MATCHED THEN UPDATE SET t.status = s.status;
```
**What happens under the hood:**
- Iceberg writes a small **Delete File** (tracking the old record) and a new **Data File** (with the update).
- Readers merge these files at query time, ensuring they always see the latest committed state.
3. Performance: The "Snapshot" Advantage
Because Iceberg uses immutable snapshots, it enables **Time Travel**:
```sql
-- Query the state of the table as of yesterday
SELECT * FROM gold.revenue FOR TIMESTAMP AS OF '2026-05-19 12:00:00';
```
This is critical for debugging data pipelines and auditing financial records without maintaining manual backups.
4. The Bridge to Level 5
Level 4 provides the technical foundation, but it still assumes a central team manages the Lakehouse. Level 5, the [Data Mesh Architecture](DataMeshArchitecture), decentralizes this technical stack across domain owners.
---
**See Also:**
- [Data Lake Architecture](DataLakeArchitecture) — The foundation for Lakehouses.
- [Change Data Capture](ChangeDataCapture) — Streaming data into the Lakehouse.
- [Data Mesh Architecture](DataMeshArchitecture) — Decentralized Lakehouse ownership.
---