Batch Processing Patterns

Batch processing: handle a large set of items in a controlled, repeatable way. Daily reports, ETL, bulk migrations, monthly billing. The needs differ from request-driven work: throughput matters more than latency; failures need restart; idempotency is critical.

This page covers the patterns.

When batch is the right tool

Periodic data work

Daily aggregations, weekly reports, monthly reconciliations. Run on a schedule.

Bulk migrations

One-time data movement: between databases, between formats, schema upgrades.

Large API operations

Processing millions of records that came in via streaming or upload.

Background analysis

ML training, statistical analysis, anything compute-heavy that's not user-facing.

When batch is wrong

Real-time needs

User wants the result now. Batch is too slow.

Stream processing fits

Continuous data flow. See streaming alternatives.

Tiny scale

Hundreds of items, processable in seconds. Skip batch infrastructure; just loop.

Core patterns

Chunking

Don't process everything at once; process in chunks of (say) 1000 items.

```python

def process_all(items, chunk_size=1000):

for chunk in chunked(items, chunk_size):

process_chunk(chunk)

commit()

```

Why:

- Memory: doesn't hold all items

- Resilience: failure in one chunk doesn't lose others

- Progress: see how far along you are

Parallelism

Process chunks in parallel. Multiple workers; each takes a chunk.

```python

with ThreadPoolExecutor(max_workers=10) as executor:

futures = [executor.submit(process_chunk, c) for c in chunks]

results = [f.result() for f in futures]

```

For CPU-bound work, multiprocessing or distributed workers (Spark, Dask).

Checkpointing

For long-running batches, save progress. On restart, resume from last checkpoint.

```python

checkpoint = load_checkpoint()

for chunk in chunks_after(checkpoint):

process_chunk(chunk)

save_checkpoint(chunk.id)

```

Otherwise: 4-hour job fails 3 hours in; restart from scratch.

Idempotency

Items may be processed twice (retry, restart). The result must be the same.

Use UPSERT instead of INSERT; use deduplication keys; design idempotent operations. See [IdempotencyPatterns](IdempotencyPatterns).

Bounded concurrency

Don't unleash unlimited workers on a database. Cap concurrency:

```python

semaphore = Semaphore(20)

def process_with_limit(item):

with semaphore:

process(item)

```

Or use a worker pool. Saves the database from being crushed.

Failure handling

Per-item retries

If processing one item fails, retry that item; continue with the rest.

Per-chunk retries

If a chunk fails, retry the chunk. Items already processed (in earlier chunks) don't repeat.

Skip failed items

After N retries, skip the item; record the failure; continue.

```python

failed_items = []

for item in items:

try:

process(item)

except Exception as e:

failed_items.append((item, e))

log.warning(f"Failed: {item}")

if failed_items:

write_failure_report(failed_items)

```

Failed items get a separate report; investigate later.

All-or-nothing

For some operations, either all items process or none. Wrap in a transaction; rollback on failure.

Risk: large transactions can be expensive. Often the chunked approach is better, accepting partial completion.

Frameworks

Simple loop

For most batch needs, a Python script with chunking is fine. No framework needed.

Spring Batch (Java)

For Java enterprise batch. Provides chunking, retries, restart, monitoring.

Heavyweight; useful for complex batch jobs.

Apache Spark

For very large data (TB+). Distributed processing; SQL-like API.

Heavyweight; useful for analytics, ETL at scale.

Apache Beam / Flink

Streaming-batch unified. For pipelines that handle both modes.

Cloud-native

AWS Step Functions + Lambda; GCP Workflows; Azure Logic Apps. Coordination and execution managed by cloud.

For most batch jobs, simple scripts work. Frameworks earn their place at scale or for specific operational needs.

Specific patterns

Producer-consumer

Producer reads input; consumer processes. Buffered queue between them.

Decouples reading from processing; throughput limited by slowest part.

Pipeline

Stage 1 → Stage 2 → Stage 3. Each stage processes; passes to next.

For multi-step batch transformations.

Fan-out / fan-in

Distribute work across many workers (fan-out); collect results (fan-in).

For parallel processing where the result needs to be aggregated.

Sliding window

For time-series batch processing. Window of recent data; slides forward.

Operational concerns

Monitoring

Per-batch metrics:

- Items processed

- Items failed

- Duration

- Memory usage

Without monitoring, batches that silently slow or fail are invisible.

Alerting

Batch didn't run. Batch ran but didn't complete. Batch completed but failed many items.

Each is different signal; alarm on each.

Resource isolation

Batch jobs can crush shared resources. Database; rate limits; CPU.

Run in dedicated environment or with rate limiting.

Idempotency vs. transactional integrity

Long-running batches with database commits can leave partial state on failure. Strategies:

- Idempotent processing (re-run produces same final state)

- Resumability (checkpoint; restart from last good state)

- Compensation (track partial state; rollback on failure)

Common failure patterns

- **No chunking.** Memory exhaustion or slow restart.

- **No checkpointing.** Long jobs restart from scratch.

- **Unbounded parallelism.** Database overload.

- **Per-item exceptions stop everything.** Bad item kills batch.

- **No idempotency.** Re-running causes duplicates or wrong totals.

- **No monitoring.** Failures invisible.

Further Reading

- [BackgroundJobProcessing](BackgroundJobProcessing) — Adjacent pattern

- [MapReduceParadigm](MapReduceParadigm) — Foundational batch model

- [DataPipelineDesign](DataPipelineDesign) — Where batch fits in data engineering

- [IdempotencyPatterns](IdempotencyPatterns) — Required for batch