dbt and Analytics Engineering

dbt (data build tool) is the dominant tool for warehouse-resident transformations. Combined with the rise of cloud data warehouses, dbt enabled a shift in how data work is done — and a new role, "analytics engineering," which sits between data engineering and data analysis.

This page covers how dbt works and what analytics engineering actually means.

What dbt does

dbt is a tool for transforming data inside a warehouse using SQL.

The core idea: write SQL SELECT statements that define transformations; dbt executes them in dependency order; the result is materialized tables/views in the warehouse.

-- models/staging/stg_orders.sql
SELECT
    id AS order_id,
    customer_id,
    amount,
    status,
    created_at
FROM {{ source('raw', 'orders') }}

The {{ source(...) }} is dbt's Jinja templating. dbt resolves it to the actual table name. Models reference each other via ref():

-- models/marts/customer_lifetime_value.sql
SELECT
    customer_id,
    SUM(amount) AS lifetime_value
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY customer_id

dbt builds a DAG of dependencies. Run dbt run and it executes models in order.

The analytics engineering role

The role that emerged with dbt:

The traditional split

Data engineers: pipelines, infrastructure, performance
Data analysts / scientists: SQL queries, dashboards, insights

The gap

Between raw data and analyst-ready data, transformations are needed. Traditionally, data engineers built these in code (Python, Spark) or analysts did ad-hoc SQL. Neither was great.

Analytics engineering

A new role that:

Owns the transformation layer (using dbt)
Bridges data engineering and analysis
Software engineering practices applied to SQL: testing, version control, code review, documentation
Result: trustworthy data marts that analysts can use

dbt is the tool; analytics engineering is the practice.

dbt features that matter

`ref()` for dependencies

Models reference each other. dbt builds the DAG; runs in correct order.

Tests

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

Run dbt test — assertions on the data. Catches regressions, broken assumptions, schema drift.

Documentation

Every model can have description; columns can have descriptions and tests. dbt docs generate produces a website.

Macros

Reusable SQL snippets. For common patterns (date casting, type coercion, etc.).

Incremental models

For large tables, only process new rows:

{{ config(materialized='incremental', unique_key='id') }}

SELECT * FROM {{ ref('source') }}
{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

Speeds up runs dramatically for append-mostly tables.

Sources

sources:
  - name: raw
    schema: raw_data
    tables:
      - name: orders
        loaded_at_field: _ingested_at
        freshness:
          warn_after: { count: 12, period: hour }

Source freshness checks; lineage from external systems.

Patterns

Three-layer model structure

sources → staging → intermediate → marts

Sources: raw data
Staging (stg_*): one model per source table; cleaning, naming
Intermediate (int_*): business logic, joins
Marts (fct_*, dim_*): final shape for analysts

The standardized structure makes large dbt projects manageable.

Tests as CI

Run dbt test in CI. Tests must pass before merge. Catches data issues before they reach production.

Version control

dbt projects live in git like any code. Branch, PR, review. The same engineering rigor as application code.

Documentation as code

Documentation lives next to models. Stays current.

When dbt is the right tool

Warehouse-resident transformations (BigQuery, Snowflake, Redshift, etc.)
SQL-friendly transformations
Team that values software engineering practices for data

When dbt isn't enough

Streaming transformations (use Flink, Spark Streaming)
Heavy ML preprocessing (use Spark, Python)
Source-side transformations (use ingestion tools)

dbt is for "transform inside the warehouse." Other transformations need other tools.

Common failure patterns

No tests. Models break silently when source data changes.
No structure. All models in one folder; nobody can navigate.
Heavy logic in marts without intermediate. Hard to debug; hard to reuse.
Macros where SQL would do. Premature abstraction.
No documentation. Other people can't use the data.
Heavy non-incremental models. Slow runs that could be incremental.