Dead Letter Queues & Retry Patterns
Handling message failures gracefully: retry with exponential backoff, dead letter queues, poison message handling, and idempotency.
What Happens When Message Processing Fails?
Consumer failures are inevitable. A database might be temporarily unavailable, a downstream API may rate-limit you, or a bug causes an unhandled exception. If a consumer negatively acknowledges a message or fails to ack before the visibility timeout, the broker re-delivers it. Without a strategy, a single bad message can cause a consumer to spin in an infinite loop of retry failures — blocking all other messages in the queue.
Retry with Exponential Backoff
The first line of defense is retry with exponential backoff and jitter. Instead of retrying immediately (which hammers a failing dependency), wait an exponentially increasing interval between attempts. Jitter (random offset) prevents the thundering herd problem where hundreds of consumers all retry simultaneously after a shared dependency recovers.
import time
import random
def process_with_retry(message, max_retries=5):
for attempt in range(max_retries):
try:
process_message(message)
return # success
except TransientError as e:
if attempt == max_retries - 1:
raise # exhausted retries
base_delay = 2 ** attempt # 1, 2, 4, 8, 16 seconds
jitter = random.uniform(0, 1) # random 0-1 second
delay = base_delay + jitter
print(f"Attempt {attempt + 1} failed. Retrying in {delay:.2f}s")
time.sleep(delay)
except PermanentError as e:
# Don't retry permanent errors (e.g., invalid message format)
send_to_dead_letter_queue(message, error=str(e))
returnDistinguish transient from permanent errors
Not all errors should trigger retries. Transient errors (network timeout, rate limit, database connection refused) are worth retrying. Permanent errors (malformed message, schema validation failure, business rule violation) will never succeed — retry only wastes resources. Send permanent failures directly to the DLQ.
Dead Letter Queues (DLQ)
A Dead Letter Queue is a special queue where messages are routed after they have exceeded the maximum retry count or have been explicitly rejected as unprocessable. The DLQ acts as a quarantine — messages are not lost, but they stop blocking the main queue. An engineer can then inspect DLQ messages, fix the bug or data issue, and replay them.
DLQ in Practice: SQS and RabbitMQ
In Amazon SQS, you configure a DLQ on the main queue with a `maxReceiveCount`. After the message has been received that many times without a successful ack, SQS automatically moves it to the DLQ. In RabbitMQ, you set `x-dead-letter-exchange` and `x-max-length` or `x-message-ttl` on the queue; messages that expire or are nacked without requeue go to the dead-letter exchange.
| System | DLQ Configuration | Key Setting |
|---|---|---|
| Amazon SQS | Set DLQ ARN on source queue | maxReceiveCount (e.g., 5) |
| RabbitMQ | x-dead-letter-exchange on queue declaration | x-max-length or nack without requeue |
| Kafka | App-level: catch exceptions, produce to DLQ topic | No native DLQ — implement in consumer |
| Azure Service Bus | Built-in dead-letter sub-queue | MaxDeliveryCount (default 10) |
Kafka doesn't have native DLQs
Kafka has no built-in dead-letter mechanism. The application must catch exceptions and explicitly produce failed messages to a separate DLQ topic (e.g., `orders-dlq`). Libraries like Spring Kafka's `DefaultErrorHandler` and Confluent's `DeadLetterPublishingRecoverer` provide this pattern out of the box.
Poison Message Handling
A poison message (or poison pill) is a message that will always cause a consumer to crash — typically due to a bug or corrupt/unexpected data. Without protection, a poison message causes an infinite crash-restart loop, starving all other messages. Mitigation strategies:
- Set a max receive count / max delivery count — Route to DLQ after N failures
- Schema validation at ingestion — Reject malformed messages before they enter the processing queue
- Catch all exceptions — Never let an unhandled exception cause message re-delivery without intent
- Circuit breaker on consumer — If error rate spikes, pause consumption to prevent cascade
Idempotency: The Cornerstone of Safe Retry
Since at-least-once delivery means duplicates are possible, consumers must be idempotent — processing the same message multiple times must have the same effect as processing it once. Approaches:
- Idempotency key in the database — Store the `eventId` in a processed-events table; check before processing. If already processed, skip and ack.
- Upsert instead of insert — `INSERT ... ON CONFLICT DO NOTHING` or `MERGE` avoids duplicate rows
- Natural idempotency — Some operations are inherently idempotent: setting a status field to `CONFIRMED` twice has no additional effect
- Conditional writes — Include expected state in the write condition (optimistic locking / conditional puts in DynamoDB)
Complete Retry Architecture
A production-grade retry strategy typically uses multiple retry queues with increasing delay — rather than sleeping inside the consumer, which ties up a thread. The message is sent to a `retry-30s` queue, consumed after a delay, retried, and if still failing, moves to `retry-5m`, then `retry-1h`, then DLQ.
Delayed retry with SQS
SQS supports a Delay Queue (0-900 second delay) and a message timer (per-message delay). For progressive backoff, use multiple SQS queues with increasing delays and routing logic in the consumer: failed → retry-queue-30s → retry-queue-5m → dlq.
Interview Tip
Dead letter queues and retry patterns are 'reliability' questions in disguise — interviewers are testing whether you think about failure scenarios. Always mention: (1) exponential backoff with jitter, (2) DLQ for exhausted retries, (3) idempotency in consumers, (4) alerting on DLQ depth. Bonus points: mention the difference between transient and permanent errors, and that Kafka requires app-level DLQ implementation.