Retry with Exponential Backoff
Smart retry strategies: exponential backoff, jitter, max retries, idempotency requirements, and when NOT to retry.
The Case for Retries
Networks are unreliable. Packets get dropped, connections time out, and services experience transient blips caused by garbage collection pauses, brief overloads, or rolling restarts. A simple retry — trying the same request again after a failure — can transparently recover from the majority of these transient errors without the user ever noticing a problem.
The challenge is doing retries safely: not overwhelming an already struggling service, not retrying non-recoverable errors, and not retrying non-idempotent operations that may have already succeeded.
Exponential Backoff
A naive retry immediately re-sends the request, which can make an overloaded service worse. Exponential backoff increases the wait time between retries geometrically: wait 1 s, then 2 s, then 4 s, then 8 s, up to a cap. This gives the service time to recover and reduces aggregate load during a degraded period.
async function retryWithBackoff<T>(
fn: () => Promise<T>,
options: {
maxRetries?: number;
baseDelayMs?: number;
maxDelayMs?: number;
jitter?: boolean;
retryOn?: (err: unknown) => boolean;
} = {}
): Promise<T> {
const {
maxRetries = 3,
baseDelayMs = 500,
maxDelayMs = 30_000,
jitter = true,
retryOn = () => true,
} = options;
let attempt = 0;
while (true) {
try {
return await fn();
} catch (err) {
if (attempt >= maxRetries || !retryOn(err)) throw err;
// Exponential backoff: 500ms, 1000ms, 2000ms, ...
let delay = Math.min(baseDelayMs * 2 ** attempt, maxDelayMs);
// Add jitter: ±30% of the delay
if (jitter) delay *= 0.7 + Math.random() * 0.6;
await sleep(delay);
attempt++;
}
}
}Adding Jitter
Without jitter, all clients that received an error at roughly the same time will retry at the same exponentially-backed-off moment — creating a thundering herd that re-hammers the recovering service in synchronized waves. Jitter adds random noise to the backoff delay so retries are spread across time. AWS recommends full jitter (random between 0 and cap) or decorrelated jitter (delay = random between base and prev_delay × 3).
When NOT to Retry
Retrying Errors That Cannot Recover Makes Things Worse
Never retry: (1) 4xx client errors (400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found) — the request itself is malformed or unauthorized and retrying achieves nothing; (2) non-idempotent operations (POST to create a payment) unless idempotency keys are used; (3) when a circuit breaker is open — the circuit exists to stop retries during known outages.
| Error Type | Retry? | Rationale |
|---|---|---|
| Network timeout / connection reset | Yes | Transient network failure |
| 503 Service Unavailable | Yes (with backoff) | Service temporarily overloaded |
| 429 Too Many Requests | Yes (respect Retry-After header) | Rate limited — wait and retry |
| 500 Internal Server Error | Yes (cautiously) | May be transient |
| 400 Bad Request | No | Client error — fix the request |
| 401 / 403 | No | Auth error — retrying won't help |
| 404 Not Found | No | Resource does not exist |
Idempotency Keys
To safely retry non-idempotent operations like payment charges, include an idempotency key — a unique ID generated by the client for each logical operation. The server stores the result keyed by this ID. On a retry, the server returns the stored result rather than re-executing the operation. Stripe, Braintree, and most payment APIs support idempotency keys via request headers like `Idempotency-Key: uuid-v4`.
Retry Budgets
At large scale, even 3 retries per original request can multiply load by 4× during an outage. Google's SRE book recommends retry budgets: a client maintains a token bucket where each retry consumes one token, and the budget limits the ratio of retries to original requests (e.g., retries cannot exceed 10% of traffic). This prevents retries from amplifying failures into cascading overload.
Interview Tip
In interviews, always mention jitter when discussing retries — it signals you understand thundering-herd problems. Pair retries with circuit breakers: retries handle transient errors; circuit breakers stop retries when a service is systemically degraded. The sequence of resilience concerns should flow: retry transient errors → circuit break sustained failures → bulkhead to contain blast radius.