Course/Reliability & Resilience Patterns/Retry with Exponential Backoff

Retry with Exponential Backoff

Smart retry strategies: exponential backoff, jitter, max retries, idempotency requirements, and when NOT to retry.

10 min readHigh interview weight

The Case for Retries

Networks are unreliable. Packets get dropped, connections time out, and services experience transient blips caused by garbage collection pauses, brief overloads, or rolling restarts. A simple retry — trying the same request again after a failure — can transparently recover from the majority of these transient errors without the user ever noticing a problem.

The challenge is doing retries safely: not overwhelming an already struggling service, not retrying non-recoverable errors, and not retrying non-idempotent operations that may have already succeeded.

Exponential Backoff

A naive retry immediately re-sends the request, which can make an overloaded service worse. Exponential backoff increases the wait time between retries geometrically: wait 1 s, then 2 s, then 4 s, then 8 s, up to a cap. This gives the service time to recover and reduces aggregate load during a degraded period.

typescript

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  options: {
    maxRetries?: number;
    baseDelayMs?: number;
    maxDelayMs?: number;
    jitter?: boolean;
    retryOn?: (err: unknown) => boolean;
  } = {}
): Promise<T> {
  const {
    maxRetries = 3,
    baseDelayMs = 500,
    maxDelayMs = 30_000,
    jitter = true,
    retryOn = () => true,
  } = options;

  let attempt = 0;
  while (true) {
    try {
      return await fn();
    } catch (err) {
      if (attempt >= maxRetries || !retryOn(err)) throw err;

      // Exponential backoff: 500ms, 1000ms, 2000ms, ...
      let delay = Math.min(baseDelayMs * 2 ** attempt, maxDelayMs);

      // Add jitter: ±30% of the delay
      if (jitter) delay *= 0.7 + Math.random() * 0.6;

      await sleep(delay);
      attempt++;
    }
  }
}

Adding Jitter

Without jitter, all clients that received an error at roughly the same time will retry at the same exponentially-backed-off moment — creating a thundering herd that re-hammers the recovering service in synchronized waves. Jitter adds random noise to the backoff delay so retries are spread across time. AWS recommends full jitter (random between 0 and cap) or decorrelated jitter (delay = random between base and prev_delay × 3).

Loading diagram...

Jitter spreads retries across time, preventing thundering-herd resynchronization

When NOT to Retry

⚠️

Retrying Errors That Cannot Recover Makes Things Worse

Never retry: (1) 4xx client errors (400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found) — the request itself is malformed or unauthorized and retrying achieves nothing; (2) non-idempotent operations (POST to create a payment) unless idempotency keys are used; (3) when a circuit breaker is open — the circuit exists to stop retries during known outages.

Error Type	Retry?	Rationale
Network timeout / connection reset	Yes	Transient network failure
503 Service Unavailable	Yes (with backoff)	Service temporarily overloaded
429 Too Many Requests	Yes (respect Retry-After header)	Rate limited — wait and retry
500 Internal Server Error	Yes (cautiously)	May be transient
400 Bad Request	No	Client error — fix the request
401 / 403	No	Auth error — retrying won't help
404 Not Found	No	Resource does not exist

Idempotency Keys

To safely retry non-idempotent operations like payment charges, include an idempotency key — a unique ID generated by the client for each logical operation. The server stores the result keyed by this ID. On a retry, the server returns the stored result rather than re-executing the operation. Stripe, Braintree, and most payment APIs support idempotency keys via request headers like `Idempotency-Key: uuid-v4`.

Retry Budgets

At large scale, even 3 retries per original request can multiply load by 4× during an outage. Google's SRE book recommends retry budgets: a client maintains a token bucket where each retry consumes one token, and the budget limits the ratio of retries to original requests (e.g., retries cannot exceed 10% of traffic). This prevents retries from amplifying failures into cascading overload.

💡

Interview Tip

In interviews, always mention jitter when discussing retries — it signals you understand thundering-herd problems. Pair retries with circuit breakers: retries handle transient errors; circuit breakers stop retries when a service is systemically degraded. The sequence of resilience concerns should flow: retry transient errors → circuit break sustained failures → bulkhead to contain blast radius.

Bulkhead Pattern

Timeout Pattern