Data Observability

Software observability asks "Is the server up?" Data observability asks "Is the data correct?" In a modern data stack, a pipeline can be 100% "healthy" according to your orchestrator while delivering 100% "garbage" data to your downstream dashboards.

The Five Pillars of Data Observability

Freshness: Is the data up to date? (e.g., Has the daily_sales table been updated in the last 24 hours?)
Distribution: Are the values within expected ranges? (e.g., Is the null_rate for user_email suddenly 50%?)
Volume: Did we get the expected number of rows? (e.g., A 90% drop in ingestion volume indicates a source-system failure.)
Schema: Did a upstream producer change a column name or type without notice?
Lineage: When a metric breaks, which upstream table caused it?

Implementing Automated Quality Checks

Do not rely on manual audits. Use dbt tests or Great Expectations to enforce quality at the pipeline level.

Example: dbt Data Test

# schema.yml
version: 2
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']

Drift Detection with SQL

You can implement basic observability using simple SQL checks run by your orchestrator (Airflow/Dagster).

-- Check for Volume Anomaly (Comparing today vs. 7-day average)
WITH stats AS (
    SELECT count(*) as row_count
    FROM events
    WHERE event_date > current_date - interval '7 days'
)
SELECT 
    count(*) as today_count,
    (SELECT row_count / 7 FROM stats) as avg_count
FROM events
WHERE event_date = current_date
HAVING count(*) < (SELECT row_count / 7 FROM stats) * 0.5; -- Alert if < 50% of average

Lineage: The Impact Analysis Tool

Lineage is a directed acyclic graph (DAG) of your data's journey.

Upstream Lineage: "My dashboard is wrong. Which table fed it?"
Downstream Lineage: "I want to delete this column. Who will I break?"

Practitioner Tip: Use OpenLineage to capture this metadata automatically from Spark, Airflow, and Flink jobs.

The "Silent Failure" Trap

The most dangerous data bug is the "Distribution Shift." If your ML model expects a value between 0 and 1, but a source system change starts sending values between 0 and 100, your pipeline won't crash, but your model's predictions will be nonsense. Fix: Monitor the Mean and Standard Deviation of critical columns.