dbt and Analytics Engineering

dbt (data build tool) is the dominant tool for warehouse-resident transformations. Combined with the rise of cloud data warehouses, dbt enabled a shift in how data work is done — and a new role, "analytics engineering," which sits between data engineering and data analysis.

This page covers how dbt works and what analytics engineering actually means.

What dbt does

dbt is a tool for transforming data inside a warehouse using SQL.

The core idea: write SQL SELECT statements that define transformations; dbt executes them in dependency order; the result is materialized tables/views in the warehouse.

```sql

-- models/staging/stg_orders.sql

SELECT

id AS order_id,

customer_id,

amount,

status,

created_at

FROM {{ source('raw', 'orders') }}

```

The `{{ source(...) }}` is dbt's Jinja templating. dbt resolves it to the actual table name. Models reference each other via `ref()`:

```sql

-- models/marts/customer_lifetime_value.sql

SELECT

customer_id,

SUM(amount) AS lifetime_value

FROM {{ ref('stg_orders') }}

WHERE status = 'completed'

GROUP BY customer_id

```

dbt builds a DAG of dependencies. Run `dbt run` and it executes models in order.

The analytics engineering role

The role that emerged with dbt:

The traditional split

- **Data engineers**: pipelines, infrastructure, performance

- **Data analysts / scientists**: SQL queries, dashboards, insights

The gap

Between raw data and analyst-ready data, transformations are needed. Traditionally, data engineers built these in code (Python, Spark) or analysts did ad-hoc SQL. Neither was great.

Analytics engineering

A new role that:

- Owns the transformation layer (using dbt)

- Bridges data engineering and analysis

- Software engineering practices applied to SQL: testing, version control, code review, documentation

- Result: trustworthy data marts that analysts can use

dbt is the tool; analytics engineering is the practice.

dbt features that matter

`ref()` for dependencies

Models reference each other. dbt builds the DAG; runs in correct order.

Tests

```yaml

models:

- name: stg_orders

columns:

- name: order_id

tests:

- unique

- not_null

```

Run `dbt test` — assertions on the data. Catches regressions, broken assumptions, schema drift.

Documentation

Every model can have description; columns can have descriptions and tests. `dbt docs generate` produces a website.

Macros

Reusable SQL snippets. For common patterns (date casting, type coercion, etc.).

Incremental models

For large tables, only process new rows:

```sql

{{ config(materialized='incremental', unique_key='id') }}

SELECT * FROM {{ ref('source') }}

{% if is_incremental() %}

WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})

{% endif %}

```

Speeds up runs dramatically for append-mostly tables.

Sources

```yaml

sources:

- name: raw

schema: raw_data

tables:

- name: orders

loaded_at_field: _ingested_at

freshness:

warn_after: { count: 12, period: hour }

```

Source freshness checks; lineage from external systems.

Patterns

Three-layer model structure

```

sources → staging → intermediate → marts

```

- **Sources**: raw data

- **Staging** (`stg_*`): one model per source table; cleaning, naming

- **Intermediate** (`int_*`): business logic, joins

- **Marts** (`fct_*`, `dim_*`): final shape for analysts

The standardized structure makes large dbt projects manageable.

Tests as CI

Run `dbt test` in CI. Tests must pass before merge. Catches data issues before they reach production.

Version control

dbt projects live in git like any code. Branch, PR, review. The same engineering rigor as application code.

Documentation as code

Documentation lives next to models. Stays current.

When dbt is the right tool

- Warehouse-resident transformations (BigQuery, Snowflake, Redshift, etc.)

- SQL-friendly transformations

- Team that values software engineering practices for data

When dbt isn't enough

- Streaming transformations (use Flink, Spark Streaming)

- Heavy ML preprocessing (use Spark, Python)

- Source-side transformations (use ingestion tools)

dbt is for "transform inside the warehouse." Other transformations need other tools.

Common failure patterns

- **No tests.** Models break silently when source data changes.

- **No structure.** All models in one folder; nobody can navigate.

- **Heavy logic in marts without intermediate.** Hard to debug; hard to reuse.

- **Macros where SQL would do.** Premature abstraction.

- **No documentation.** Other people can't use the data.

- **Heavy non-incremental models.** Slow runs that could be incremental.

Further Reading

- [EtlVsElt](EtlVsElt) — Why dbt fits the modern stack

- [DataPipelineDesign](DataPipelineDesign) — Where dbt sits in pipelines

- [DataModelingFundamentals](DataModelingFundamentals) — What you're modeling

- [CleanCodePrinciples](CleanCodePrinciples) — Apply to SQL too

- [DataEngineering Hub](DataEngineeringHub) — Cluster index