Background Job Processing

Background jobs are tasks that don't fit in a request-response cycle: send email, generate PDF, process upload, run report. The HTTP handler enqueues the job; a worker process executes it later.

This page covers the patterns for reliable background job systems.

When you need background jobs

Long-running tasks

Email sending, file processing, ML inference. Anything that takes more than a couple seconds shouldn't block the HTTP response.

Retry-able failures

Tasks that need retries (third-party API failures, eventual consistency).

Decoupling

The web tier and worker tier scale independently. Heavy job processing doesn't slow web responses.

Scheduled work

Periodic tasks: nightly cleanup, daily reports.

The basic architecture

HTTP request → Enqueue job → 200 OK (immediate)

Background:
Worker → Pull job from queue → Execute → Mark done (or retry)

Components:

Queue: Redis, RabbitMQ, SQS, Kafka
Workers: long-running processes that consume from the queue
Job results: stored separately if needed

Queue technology

Redis-backed (Sidekiq, BullMQ, RQ)

Simple; fast; relatively easy to operate. Limited durability if Redis fails.

Database-backed

The job table is in your application database. Simpler architecture; no extra infrastructure.

Limitations: doesn't scale to high job volume; database load.

Dedicated message queue (RabbitMQ, NATS)

Built for queueing. More features (routing, dead-letter, etc.). More infrastructure to run.

Cloud-managed (SQS, Pub/Sub, Service Bus)

Managed by cloud provider. SQS for AWS; Pub/Sub for GCP; Service Bus for Azure.

For most cloud-native shops, the managed option is right.

Stream-based (Kafka)

For high throughput with replay needs. More complex than queues.

Worker design

Long-running

Worker process pulls job; executes; pulls next. Not per-job process.

Concurrency

Workers process many jobs in parallel. Configurable concurrency per worker.

Multiple workers

Scale horizontally. Many workers consuming the same queue.

Worker isolation

Each worker process is independent. One worker dying doesn't affect others.

Reliability patterns

Idempotency

Jobs may be retried. The same job running twice should produce the same result.

Use idempotency keys, dedup logic, or idempotent operations. See IdempotencyPatterns.

Retries with backoff

Job fails; retry with exponential backoff. Don't retry forever; eventually give up.

Typical: 3-5 retries; backoff 1m, 5m, 15m, 1h, 6h.

Dead letter queues

Jobs that fail all retries go to a DLQ. Investigate; either fix and re-run, or accept failure.

See DeadLetterQueuePatterns.

Visibility timeout

When a worker pulls a job, it has a timeout to complete. If timeout exceeded, job becomes available again — another worker picks it up.

Prevents lost jobs from worker crashes. But: jobs longer than the timeout get processed twice (which is why idempotency matters).

At-least-once vs. exactly-once

Most queues are at-least-once. Jobs may run more than once; consumers handle duplicates.

"Exactly-once" is rare and expensive. Don't promise it; design for at-least-once.

Job design

Small jobs

Each job does one thing. Easier to retry; easier to reason about; easier to debug.

Pass IDs, not data

Job: {type: "process_upload", upload_id: "abc"} not the entire upload data. The worker fetches the data fresh; data doesn't go stale in queue.

Fast jobs

Long jobs are problematic: visibility timeouts; lost progress on crash; harder to retry.

If a job naturally takes hours, decompose into smaller jobs.

Versioning

Jobs in the queue may run with newer or older worker code. Design for compatibility:

Add new fields; don't remove
Default values for missing fields
Reject incompatible jobs explicitly

Specific frameworks

Sidekiq (Ruby)

Dominant for Ruby. Redis-backed.

BullMQ (Node.js)

Modern Node.js. Redis-backed.

RQ, Celery (Python)

Celery is more feature-rich; RQ is simpler. Both Redis-backed by default.

Spring Batch (Java)

For batch processing. See BatchProcessingPatterns.

Quartz (Java)

Java scheduling. Older but mature.

Cloud-native

AWS SQS + Lambda; GCP Pub/Sub + Cloud Functions. Serverless workers.

Common patterns

Job classes vs. raw queue

Frameworks define job classes:

class ProcessUploadJob
  def perform(upload_id)
    # work
  end
end

ProcessUploadJob.perform_async(upload_id)

Cleaner than manually serializing/deserializing.

Priority queues

Some jobs are higher priority. Multiple queues with different priorities; workers pull from high-priority first.

Concurrency limits per type

Don't run 1000 video transcodes simultaneously; not 1000 emails at once. Per-job-type concurrency limits.

Job lifecycle hooks

Before/after job hooks. Logging; metrics; cleanup.

Cron-style schedules

Periodic jobs. Many job frameworks have cron-like scheduling. See ScheduledTaskManagement.

Common failure patterns

Slow jobs without segmentation. A few slow jobs hold all worker capacity.
No idempotency. Retries cause duplicates.
No DLQ. Failed jobs lost forever.
Workers not monitored. Jobs queued; never processed; nobody notices.
Job data in queue not service. Stale data when job runs.
Long-running jobs without checkpointing. Crash means restart from scratch.

A reasonable starter

For new applications:

Pick a queue technology (cloud-managed for most cases)
Pick a job framework for your language
All jobs idempotent
Standard retry policy (3-5 retries, exponential backoff)
DLQ for failed-completely jobs
Monitoring on queue depth, worker count, job latency
Alerting on DLQ growth and stuck jobs