Data Observability

Software observability asks "Is the server up?" Data observability asks "Is the data correct?" In a modern data stack, a pipeline can be 100% "healthy" according to your orchestrator while delivering 100% "garbage" data to your downstream dashboards.

The Five Pillars of Data Observability

1. **Freshness:** Is the data up to date? (e.g., Has the `daily_sales` table been updated in the last 24 hours?)

2. **Distribution:** Are the values within expected ranges? (e.g., Is the `null_rate` for `user_email` suddenly 50%?)

3. **Volume:** Did we get the expected number of rows? (e.g., A 90% drop in ingestion volume indicates a source-system failure.)

4. **Schema:** Did a upstream producer change a column name or type without notice?

5. **Lineage:** When a metric breaks, which upstream table caused it?

Implementing Automated Quality Checks

Do not rely on manual audits. Use **dbt tests** or **Great Expectations** to enforce quality at the pipeline level.

Example: dbt Data Test

```yaml

schema.yml

version: 2

models:

- name: orders

columns:

- name: order_id

tests:

- unique

- not_null

- name: status

tests:

- accepted_values:

values: ['placed', 'shipped', 'completed', 'returned']

```

Drift Detection with SQL

You can implement basic observability using simple SQL checks run by your orchestrator (Airflow/Dagster).

```sql

-- Check for Volume Anomaly (Comparing today vs. 7-day average)

WITH stats AS (

SELECT count(*) as row_count

FROM events

WHERE event_date > current_date - interval '7 days'

)

SELECT

count(*) as today_count,

(SELECT row_count / 7 FROM stats) as avg_count

FROM events

WHERE event_date = current_date

HAVING count(*) < (SELECT row_count / 7 FROM stats) * 0.5; -- Alert if < 50% of average

```

Lineage: The Impact Analysis Tool

Lineage is a directed acyclic graph (DAG) of your data's journey.

- **Upstream Lineage:** "My dashboard is wrong. Which table fed it?"

- **Downstream Lineage:** "I want to delete this column. Who will I break?"

**Practitioner Tip:** Use **OpenLineage** to capture this metadata automatically from Spark, Airflow, and Flink jobs.

The "Silent Failure" Trap

The most dangerous data bug is the **"Distribution Shift."** If your ML model expects a value between 0 and 1, but a source system change starts sending values between 0 and 100, your pipeline won't crash, but your model's predictions will be nonsense.

**Fix:** Monitor the **Mean** and **Standard Deviation** of critical columns.

Further Reading

- [DataMeshArchitecture](DataMeshArchitecture) — Decentralized data ownership and observability.

- [DataQualityFrameworks](DataQualityFrameworks) — Comparing Soda, Great Expectations, and Monte Carlo.

- [DistributedTracing](DistributedTracing) — Monitoring the services that generate the data.