Batch vs. Stream Processing: Architectural Patterns

The choice between batch and stream processing defines the latency and consistency guarantees of a data platform. While batch processing optimizes for throughput and historical completeness, stream processing optimizes for low-latency insight. Modern architectures attempt to unify these paradigms.

1. Batch Processing: The Throughput Specialist

Batch processing operates on high-latency, large-volume data blocks. It is characterized by bounded datasets where the "end" of the data is known.

Strengths: High efficiency via parallelization, reliable error recovery (re-running a job), and deep historical analysis.
Weaknesses: High latency (minutes to hours).

2. Stream Processing: The Latency Specialist

Stream processing operates on unbounded data streams, processing events as they arrive.

Strengths: Sub-second latency, real-time alerting, and continuous state updates.
Weaknesses: Complexity in handling late-arriving data and state management.

3. Distributed Architectures: Lambda vs. Kappa

3.1 Lambda Architecture

The Lambda architecture attempts to provide both low-latency views and highly accurate historical views by running two separate pipelines:

Batch Layer: The "source of truth." It stores all raw data and periodically computes complex, accurate views (e.g., via Spark/MapReduce).
Speed Layer: Processes recent data in real-time to provide low-latency views (e.g., via Flink/Storm). It compensates for the lag of the batch layer.
Serving Layer: Merges results from both layers to answer queries.

Criticism: The main drawback is logic duplication. Developers must write and maintain the same transformation logic in two different systems (e.g., Java for the speed layer and SQL/Scala for the batch layer).

3.2 Kappa Architecture

Proposed by Jay Kreps, the Kappa architecture simplifies the system by removing the batch layer entirely. Everything is treated as a stream.

Stream Layer: A single pipeline (e.g., Apache Flink or Kafka Streams) handles both real-time processing and historical re-processing.
Re-processing: To "re-run" a batch job, the system simply re-plays historical data from the beginning of the stream (using a high-retention log like Kafka).

Advantage: Single code base for all data processing.

4. The "Exactly-Once" Semantics Challenge

Achieving exactly-once semantics (EOS) in distributed systems is difficult due to network failures and retries.

4.1 At-Least-Once vs. Exactly-Once

At-Least-Once: Events are guaranteed to be processed, but may be processed multiple times due to retries. Requires idempotency at the sink to avoid duplicate data.
Exactly-Once: Events are processed and reflected in the final state exactly once, even if failures occur.

4.2 Mechanisms for Exactly-Once

Checkpointing (Flink): Periodic snapshots of the entire distributed state. On failure, the system rolls back to the last successful checkpoint.
Transactional Writes: Using two-phase commit (2PC) between the stream processor and the sink (e.g., Kafka's transactional producer).
Idempotency: Designing the processing logic such that $f(x) = f(f(x))$ . For example, using UPSERT into a database based on a unique event ID rather than INSERT.

5. Watermarking and Windowing

In streaming, data often arrives out of order.

Windowing: Grouping events by time (Sliding, Tumbling, or Session windows).
Watermarks: A mechanism to signal that no more events with a timestamp earlier than $T$ are expected. It allows the system to close a window and emit results despite the unbounded nature of the stream.

Summary: When to Use Which?

Use Batch: For financial reconciliation, historical reporting, and large-scale model training.
Use Stream: For fraud detection, real-time dashboards, and sensor monitoring.
Use Kappa: When you want to simplify operations and your stream processor can handle the historical volume.