ETL vs. ELT

The traditional pattern: Extract, Transform, Load (ETL). Pull data from sources, transform in a separate process, load the cleaned data into the warehouse.

The modern pattern: Extract, Load, Transform (ELT). Pull raw data, load it as-is into the warehouse, transform inside the warehouse using SQL.

The shift is real and recent. ELT is the dominant paradigm in modern data stacks.

Why ETL was the default

Old data warehouses (Teradata, Oracle, on-prem) were:

- Expensive (licensed by capacity)

- Constrained (limited storage and compute)

- Slow at heavy transformation

Loading raw data and transforming inside was expensive. The natural pattern: clean and reduce data before loading; warehouse only sees the final shape.

Tools: Informatica, Talend, custom Python/Java jobs running outside the warehouse.

What changed

Modern cloud data warehouses (Snowflake, BigQuery, Redshift, Databricks) are:

- Storage and compute decoupled

- Storage cheap; compute scales elastically

- SQL-on-anything capable

Loading raw data is cheap. Transformation in SQL is fast. The warehouse can do what dedicated ETL tools used to do, but in SQL that analysts can read and write.

The natural pattern flipped: load raw; transform in warehouse.

ELT in practice

```

Sources → Ingestion (Fivetran, Airbyte, custom) → Raw warehouse layer → dbt → Analytics tables

```

The warehouse holds:

- **Raw**: as it came from the source

- **Staging**: light cleaning, naming

- **Intermediate**: business logic

- **Marts**: shaped for analysis

Each layer is SQL transforms over the previous. dbt is the dominant tool for managing these transforms. See [DbtAndAnalyticsEngineering](DbtAndAnalyticsEngineering).

Pros of ELT

Simplicity

Most transformations are SQL. Analysts can read and modify them. No separate ETL skill required.

Reproducibility

Raw data preserved. Transformations re-run produce the same result. Bug found? Fix the SQL; re-run.

Discoverability

Data is in the warehouse already. Analysts can explore raw and transformed data freely.

Speed

Modern warehouses are fast. Transformations in SQL on Snowflake/BigQuery often beat external ETL.

Backfills are easier

Raw data is there; just re-run transforms. No need to re-extract from sources.

Pros of ETL (still)

Some transformations don't fit SQL

Complex JSON parsing, ML feature engineering, image processing. Sometimes the transform belongs in code, not SQL.

Privacy / compliance

Sometimes you can't load raw data. PII redaction must happen before warehouse.

Source system load

Heavy raw data load can stress source systems. Lightweight extracts may be necessary.

Cost at extreme scale

ELT in cloud warehouses is cheap until it isn't. At extreme scale, dedicated processing (Spark, Flink) can beat warehouse compute.

When ETL still wins

- PII or compliance requires pre-load redaction

- Transformation is non-SQL-shaped (ML, complex parsing)

- Source data is enormous and warehouse loading is impractical

- Specific cost optimization at scale

For most modern data stacks, ELT is right. ETL is for specific cases.

The "ELTL" hybrid

Some modern stacks do:

- Extract → Load (raw) → Transform (in warehouse) → Load (to operational stores)

The transformations happen in the warehouse; the result is loaded to operational stores (search indexes, caches, etc.) for low-latency access.

This is functional ELT plus a "publish" step.

Tooling

Ingestion (the "EL" part)

- **Fivetran, Stitch**: managed connectors; pay for convenience

- **Airbyte**: open-source equivalent

- **Custom**: when no managed connector exists

Warehouse (the "T" target)

- **Snowflake**: dominant; multi-cloud

- **BigQuery**: GCP-native; very fast

- **Redshift**: AWS-native; older

- **Databricks**: lakehouse architecture

Transformation (the "T" engine)

- **dbt**: dominant for SQL transforms

- **SQL**: for simpler cases

- **Spark / PySpark**: for non-SQL transformations

Orchestration

- **Airflow**: heavy; classic

- **Prefect, Dagster**: modern alternatives

- **dbt Cloud**: integrated with dbt

Common failure patterns

- **Treating ELT as "load everything raw"** without thinking about what's actually needed. Massive raw data with no usage is just cost.

- **No transformation testing.** SQL transforms can have bugs; tests are needed.

- **Heavy lift-and-shift of old ETL.** Often the right answer is to redesign around ELT, not port.

- **Privacy data loaded raw.** PII in the warehouse without redaction.

- **dbt without lineage tracking.** Can't tell what depends on what.

Further Reading

- [DataPipelineDesign](DataPipelineDesign) — Pipeline patterns

- [DbtAndAnalyticsEngineering](DbtAndAnalyticsEngineering) — The transformation tool

- [DataModelingFundamentals](DataModelingFundamentals) — What you're modeling

- [CloudDatabases](CloudDatabases) — Warehouse options

- [DataEngineering Hub](DataEngineeringHub) — Cluster index