Data Pipeline Design
A data pipeline moves and transforms data: from sources, through stages, to destinations. The patterns that work in production differ from the textbook ETL examples. Real pipelines must handle failures, late-arriving data, schema changes, retries, and observability.
This page covers the patterns that hold up.
The components
Most pipelines have:
- **Sources**: where data comes from (databases, APIs, event streams, files)
- **Ingestion**: pulling data into your system
- **Transformation**: cleaning, joining, aggregating
- **Storage**: data lake, warehouse, operational store
- **Orchestration**: scheduling, dependencies, retries
- **Observability**: monitoring, alerting, lineage
Idempotency
The most important property. Running the same pipeline twice should produce the same result.
Why it matters:
- Failures happen; retries are expected
- Backfills require re-running
- Bugs require fixing data after the fact
Idempotency requires:
- **Deterministic transformations**: same input → same output
- **Upserts not inserts**: re-running doesn't double-insert
- **Time windows that don't shift**: yesterday's job processes a defined window
Anti-patterns:
- Auto-incrementing IDs assigned in pipeline (re-runs produce different IDs)
- "Just append" patterns (duplicates on retry)
- Operations on "yesterday" without explicit dates (retry days later runs different data)
Partitioning
Most pipelines partition by date:
```
data/
events/
date=2026-04-25/
date=2026-04-26/
date=2026-04-27/
```
Each partition is independent. Re-running just `2026-04-26` doesn't affect other dates. Failures isolate to specific partitions.
Partition keys vary:
- Date (most common)
- Date + hour for high-volume
- Date + region or customer
- Whatever makes work parallelizable
Late-arriving data
Real-world data arrives late. An event from 2026-04-25 might appear in your source on 2026-04-27.
Strategies:
- **Reprocessing window**: re-run last N days of partitions to catch late data
- **Watermarking**: partition by event time, not arrival time
- **Allow late merge**: pipeline designed to update partitions when late data arrives
The right approach depends on data volume and timeliness requirements. Most pipelines need to handle late data; pretending it doesn't exist produces incorrect analytics.
Schema evolution
Source data structure changes. Field added, type changed, field renamed.
Patterns:
- **Schema-on-read**: store raw; apply schema when reading. Flexible but slow.
- **Schema-on-write**: enforce schema at ingestion. Fast but rigid.
- **Forward-compatible serialization**: Avro, Protobuf, or schema registry that handles evolution
For data lakes (S3 + Parquet), schema-on-read is common. For warehouses, schema-on-write.
Either way, plan for schema changes. They will happen.
Orchestration
The scheduler that runs pipelines, manages dependencies, retries failures.
Apache Airflow
The dominant choice. DAG (Directed Acyclic Graph) of tasks; Python-based.
Pros: powerful; large community; many operators.
Cons: heavyweight; UI is dated; running it well requires real ops investment.
Prefect, Dagster, Mage
Modern alternatives. Python-based; emphasize developer ergonomics; cleaner UIs.
dbt
For warehouse-resident transformations. Different role than orchestration; sometimes paired with Airflow. See [DbtAndAnalyticsEngineering](DbtAndAnalyticsEngineering).
Cloud-native
AWS Step Functions, GCP Cloud Composer (managed Airflow), Azure Data Factory. Less ops; cloud-specific.
For most teams, managed Airflow or one of the modern alternatives is the right choice.
Observability
Pipelines fail in subtle ways. Without observability, failures are silent.
Required:
- **Job-level metrics**: runtime, rows processed, error count
- **Logging**: structured; queryable
- **Lineage**: which datasets depend on which
- **Data quality checks**: row counts, null rates, value distributions
Tools: dbt has tests; Great Expectations and Soda for explicit data quality; OpenLineage for cross-tool lineage.
Without these, "is the pipeline OK?" is unanswerable.
Backfills
Re-running pipelines for past data. Common reasons:
- Bug found; re-process to fix
- New computation added to historical data
- Schema change requires regeneration
Designing for backfill:
- Idempotent (covered above)
- Partitioned (so you re-run specific partitions)
- Resource-throttled (don't crush the warehouse during backfill)
- Documented sequence (which partitions in what order)
Backfills that take a week to plan are common. Designing for them up front saves time.
Streaming vs. batch
Two paradigms:
Batch
Pipelines run on a schedule. Daily, hourly, every 15 minutes. Process accumulated data.
Pros: simpler; recovery is easier; tooling is mature.
Cons: latency = batch interval.
Streaming
Pipelines run continuously. Each event processed as it arrives.
Pros: low latency.
Cons: complex; harder to debug; harder to backfill.
Most pipelines should be batch. Stream when latency genuinely matters (real-time fraud, real-time recommendations).
The "Lambda architecture" — batch + streaming both running the same logic — has fallen out of favor; modern systems use one or the other.
Common failure patterns
- **Non-idempotent pipelines.** Retries cause duplicates.
- **No partitioning.** Failures affect everything.
- **No observability.** Silent failures.
- **Streaming when batch would do.** Complexity without benefit.
- **No data quality checks.** Bad data flows downstream.
- **No backfill plan.** Recovery from bugs takes weeks.
- **Heavy initialization per task.** Slow pipelines.
Further Reading
- [EtlVsElt](EtlVsElt) — Where transformation happens
- [MapReduceParadigm](MapReduceParadigm) — Foundational batch model
- [DbtAndAnalyticsEngineering](DbtAndAnalyticsEngineering) — Warehouse-resident transform
- [DataModelingFundamentals](DataModelingFundamentals) — What pipelines produce
- [DataEngineering Hub](DataEngineeringHub) — Cluster index