Dead Letter Queue Patterns

A dead letter queue (DLQ) is where messages go when they fail repeatedly. Without a DLQ, repeatedly-failing messages either retry forever (consuming resources) or get dropped silently (data loss).

The DLQ is the safety net. Done well, it surfaces failures for human investigation. Done poorly, it accumulates without anyone looking, hiding the same problems.

How it works

Queue → Worker tries to process → Failure → Retry → ... → 
Max retries exceeded → Message goes to DLQ

The original queue keeps moving; the failed message is preserved separately for investigation.

What goes in a DLQ

Messages with permanent errors

Bug in worker code that always fails on this message. Retry won't help.

Messages that timeout

Worker takes too long; visibility timeout fires; eventually exhausts retries.

Messages with malformed data

Worker can't parse; rejects; retries fail the same way.

Resource-exhaustion failures

Worker runs out of memory; retries hit the same wall.

DLQ implementation

Cloud-managed queues

AWS SQS:

Source queue → max_receives = 5 → DLQ

After 5 failed receives, message moves to DLQ automatically.

Similar in GCP Pub/Sub, Azure Service Bus.

Self-managed

Implement in worker code:

try:
    process(message)
except Exception as e:
    if message.retry_count < MAX_RETRIES:
        requeue(message, delay=backoff(message.retry_count))
    else:
        dlq.send(message, error=str(e))

Less convenient; more flexible.

Database-backed

The "DLQ" is a database table for failed jobs:

INSERT INTO failed_jobs (job_data, error, attempts, failed_at)
VALUES (...);

Queryable; investigatable; retentioned.

What to do with DLQ contents

Investigate

Why did this message fail? Look at:

The message content
The error message
When it happened
How many retries

Common causes:

Bug in worker code
Bad message data
External dependency failure

Categorize

Group DLQ messages by error pattern. Many similar errors usually mean one bug; one fix resolves many messages.

Replay

After fixing the bug, re-process the messages.

for message in dlq.read_all():
    main_queue.send(message)

For some failures, replay won't help (message is fundamentally bad). Discard those after deliberate decision.

Discard

Some messages should never have been processed. Discard explicitly with logging:

log.info(f"Discarding bad message: {message.id}")
dlq.delete(message)

Operational practices

Alert on DLQ growth

A DLQ that's accumulating is signaling a problem. Alert when:

DLQ depth exceeds threshold
DLQ grew significantly in the last hour
DLQ has messages older than X days

Alert noise: alert on patterns, not on every individual message.

Periodic review

Weekly or daily: how many messages in DLQ? Investigate. Fix or discard.

A DLQ left to accumulate becomes useless — too many messages to triage.

Retention policy

DLQ messages have storage cost. Set retention:

Keep for N days
Periodically archive
Eventually delete

Without retention, DLQ grows forever.

Documentation

Each DLQ has documentation: what's its source queue, what does success vs. failure look like, who owns triage?

Specific patterns

Tiered DLQs

Multiple DLQs by failure type:

Transient errors: retry-able later
Permanent errors: needs investigation
Malformed: parsing failed

Different triage process per tier.

DLQ with context

Don't just store the failed message. Store also:

Error message
Stack trace
Worker version
Timestamp
Retry count

Investigation is much easier with context.

Replay after fix

After fixing the bug, replay all messages in DLQ. Be careful: re-running them may have side effects (notifications, billing). Verify before mass replay.

Selective replay

Sometimes only some messages should replay. Filter:

for message in dlq.read_all():
    if message.error_type == "transient_network_error":
        main_queue.send(message)
    else:
        log.info(f"Skipping non-replayable: {message.id}")

Notification on first failure of new type

When a new error pattern appears in DLQ, page the on-call. New errors usually mean new bugs.

Common failure patterns

No DLQ. Failed messages retry forever or get lost.
DLQ but no monitoring. Accumulates silently.
No triage process. DLQ has 100K messages; nobody looks.
Retention too short. Messages deleted before investigation.
Replay without thought. Side effects multiplied.
DLQ for non-DLQ purposes. "Error queue" used as a general dump.

A reasonable approach

For new queue-based systems:

DLQ on every queue (typically max_receives = 5)
Monitoring: depth, growth rate, message age
Alerting on growth
Documented triage process
Retention policy (typically 14-30 days)
Periodic review (weekly minimum)