Change Data Capture (CDC)

Change Data Capture (CDC) is a technique for observing and capturing changes made to a database and delivering them as real-time events to downstream systems. Unlike polling-based methods, modern CDC is **log-based**, directly reading the database's internal transaction logs (e.g., PostgreSQL WAL, MySQL Binlog).

Why Log-Based CDC?

1. **Low Latency**: Changes are captured near-instantly after a commit.

2. **Zero Impact on Schema**: No need for `last_modified` columns or triggers that slow down production writes.

3. **Capture Deletes**: Polling cannot detect hard deletes; log-based CDC captures the `DELETE` event from the transaction log.

4. **Consistency**: Captures every state change, ensuring no intermediate updates are missed (critical for financial audit trails).

The Debezium Architecture

Debezium is the industry-standard open-source platform for CDC. It typically runs as a set of connectors within **Kafka Connect**.

- **Source Connector**: Connects to the source DB (e.g., PostgreSQL) using a replication slot. It reads the WAL, parses the binary data, and produces JSON or Avro messages to a Kafka topic.

- **Topic per Table**: By default, Debezium creates one Kafka topic for every table being tracked (e.g., `dbserver1.inventory.orders`).

- **Schema Registry**: Highly recommended to use Avro/Protobuf with a schema registry to handle source schema changes without breaking consumers.

Concrete Example: Debezium PostgreSQL Connector Config

To capture changes from a PostgreSQL database, you must set `wal_level = logical` in `postgresql.conf` and provide a connector configuration.

**JSON Configuration for Kafka Connect**:

```json

{

"name": "inventory-connector",

"config": {

"connector.class": "io.debezium.connector.postgresql.PostgresConnector",

"tasks.max": "1",

"database.hostname": "postgres-db",

"database.port": "5432",

"database.user": "debezium",

"database.password": "dbz",

"database.dbname": "inventory",

"database.server.name": "dbserver1",

"table.include.list": "public.orders,public.customers",

"plugin.name": "pgoutput",

"publication.autocreate.mode": "filtered",

"slot.name": "debezium_fulfillment_slot",

"snapshot.mode": "initial"

}

}

```

The Payload Structure

A Debezium event contains a `before` and `after` block.

```json

{

"op": "u",

"before": { "id": 1, "status": "PENDING" },

"after": { "id": 1, "status": "SHIPPED" },

"source": { "ts_ms": 1716200000000, "snapshot": "false" }

}

```

Advanced Patterns

- **Outbox Pattern**: Instead of capturing changes from a business table, the application writes a specific event record to an `outbox` table in the same transaction. CDC tracks only the `outbox` table, ensuring that the event is only published if the primary transaction succeeds.

- **Dead Letter Queues (DLQ)**: If a consumer cannot parse a CDC event (e.g., due to an unexpected schema change), the record is routed to a DLQ for manual inspection, preventing the entire pipeline from stalling.

- **Materialized Views**: Using CDC to keep a downstream search engine (Elasticsearch) or cache (Redis) in sync with the relational source of truth.

Summary of Technical implementation added

- Explained the mechanics of **Log-Based CDC** vs. polling.

- Detailed the **Debezium + Kafka Connect** architecture.

- Provided a concrete **JSON configuration** for a PostgreSQL connector.

- Illustrated the **before/after payload** structure.

- Introduced the **Outbox Pattern** and **Materialized Views** as advanced use cases.