Data Pipeline Design

A data pipeline moves and transforms data: from sources, through stages, to destinations. The patterns that work in production differ from the textbook ETL examples. Real pipelines must handle failures, late-arriving data, schema changes, retries, and observability.

This page covers the patterns that hold up.

The components

Most pipelines have:

Sources: where data comes from (databases, APIs, event streams, files)
Ingestion: pulling data into your system
Transformation: cleaning, joining, aggregating
Storage: data lake, warehouse, operational store
Orchestration: scheduling, dependencies, retries
Observability: monitoring, alerting, lineage

Idempotency

The most important property. Running the same pipeline twice should produce the same result.

Why it matters:

Failures happen; retries are expected
Backfills require re-running
Bugs require fixing data after the fact

Idempotency requires:

Deterministic transformations: same input → same output
Upserts not inserts: re-running doesn't double-insert
Time windows that don't shift: yesterday's job processes a defined window

Anti-patterns:

Auto-incrementing IDs assigned in pipeline (re-runs produce different IDs)
"Just append" patterns (duplicates on retry)
Operations on "yesterday" without explicit dates (retry days later runs different data)

Partitioning

Most pipelines partition by date:

data/
  events/
    date=2026-04-25/
    date=2026-04-26/
    date=2026-04-27/

Each partition is independent. Re-running just 2026-04-26 doesn't affect other dates. Failures isolate to specific partitions.

Partition keys vary:

Date (most common)
Date + hour for high-volume
Date + region or customer
Whatever makes work parallelizable

Late-arriving data

Real-world data arrives late. An event from 2026-04-25 might appear in your source on 2026-04-27.

Strategies:

Reprocessing window: re-run last N days of partitions to catch late data
Watermarking: partition by event time, not arrival time
Allow late merge: pipeline designed to update partitions when late data arrives

The right approach depends on data volume and timeliness requirements. Most pipelines need to handle late data; pretending it doesn't exist produces incorrect analytics.

Schema evolution

Source data structure changes. Field added, type changed, field renamed.

Patterns:

Schema-on-read: store raw; apply schema when reading. Flexible but slow.
Schema-on-write: enforce schema at ingestion. Fast but rigid.
Forward-compatible serialization: Avro, Protobuf, or schema registry that handles evolution

For data lakes (S3 + Parquet), schema-on-read is common. For warehouses, schema-on-write.

Either way, plan for schema changes. They will happen.

Orchestration

The scheduler that runs pipelines, manages dependencies, retries failures.

Apache Airflow

The dominant choice. DAG (Directed Acyclic Graph) of tasks; Python-based.

Pros: powerful; large community; many operators. Cons: heavyweight; UI is dated; running it well requires real ops investment.

Prefect, Dagster, Mage

Modern alternatives. Python-based; emphasize developer ergonomics; cleaner UIs.

dbt

For warehouse-resident transformations. Different role than orchestration; sometimes paired with Airflow. See DbtAndAnalyticsEngineering.

Cloud-native

AWS Step Functions, GCP Cloud Composer (managed Airflow), Azure Data Factory. Less ops; cloud-specific.

For most teams, managed Airflow or one of the modern alternatives is the right choice.

Observability

Pipelines fail in subtle ways. Without observability, failures are silent.

Required:

Job-level metrics: runtime, rows processed, error count
Logging: structured; queryable
Lineage: which datasets depend on which
Data quality checks: row counts, null rates, value distributions

Tools: dbt has tests; Great Expectations and Soda for explicit data quality; OpenLineage for cross-tool lineage.

Without these, "is the pipeline OK?" is unanswerable.

Backfills

Re-running pipelines for past data. Common reasons:

Bug found; re-process to fix
New computation added to historical data
Schema change requires regeneration

Designing for backfill:

Idempotent (covered above)
Partitioned (so you re-run specific partitions)
Resource-throttled (don't crush the warehouse during backfill)
Documented sequence (which partitions in what order)

Backfills that take a week to plan are common. Designing for them up front saves time.

Streaming vs. batch

Two paradigms:

Batch

Pipelines run on a schedule. Daily, hourly, every 15 minutes. Process accumulated data.

Pros: simpler; recovery is easier; tooling is mature. Cons: latency = batch interval.

Streaming

Pipelines run continuously. Each event processed as it arrives.

Pros: low latency. Cons: complex; harder to debug; harder to backfill.

Most pipelines should be batch. Stream when latency genuinely matters (real-time fraud, real-time recommendations).

The "Lambda architecture" — batch + streaming both running the same logic — has fallen out of favor; modern systems use one or the other.

Common failure patterns

Non-idempotent pipelines. Retries cause duplicates.
No partitioning. Failures affect everything.
No observability. Silent failures.
Streaming when batch would do. Complexity without benefit.
No data quality checks. Bad data flows downstream.
No backfill plan. Recovery from bugs takes weeks.
Heavy initialization per task. Slow pipelines.