Course/Reliability & Resilience Patterns/Throttling Pattern

Throttling Pattern

Control resource consumption: token bucket, leaky bucket, fixed window, sliding window algorithms. Client-side vs server-side throttling.

15 min readHigh interview weight

Throttling and Rate Limiting

Throttling controls the rate at which a consumer can send requests to a service — protecting the service from overload, ensuring fair resource sharing among clients, and enforcing business policies (e.g., API tier limits). Rate limiting is the mechanism used to implement throttling. When a client exceeds its limit, the server responds with `429 Too Many Requests` and optionally a `Retry-After` header.

Algorithm 1: Fixed Window Counter

The simplest algorithm: count requests in a fixed time window (e.g., 100 requests per minute). Reset the counter at the start of each window. The weakness is the boundary burst problem: a client can send 100 requests at 12:00:59 and 100 more at 12:01:00, for 200 requests in ~2 seconds — double the intended rate.

Algorithm 2: Sliding Window Log

Store a timestamped log of each request. On each new request, remove entries older than the window and count the remaining. Allows exactly N requests in any rolling window. Accurate but memory-intensive: stores every request timestamp, so it scales poorly for high-traffic APIs.

Algorithm 3: Sliding Window Counter

A practical hybrid: combine the current window's count with the previous window's count weighted by how much of the current window has elapsed. For example: `allowed = prev_count × (1 - elapsed/window) + curr_count`. Smooth, memory-efficient, and used in production systems like Cloudflare's rate limiter.

Algorithm 4: Token Bucket

A bucket holds up to N tokens. Tokens are added at a constant rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket allows bursting up to the bucket capacity while enforcing an average rate. Used by AWS API Gateway, NGINX, and most cloud throttling systems.

Loading diagram...

Token bucket: requests consume tokens; tokens refill at a constant rate; burst allowed up to bucket capacity

Algorithm 5: Leaky Bucket

Requests enter a queue (the 'bucket') and are processed at a fixed rate (they 'leak' out). If the queue is full, new requests are dropped. This smooths bursts into a constant output rate — useful when downstream services cannot handle bursts. The key difference from token bucket: leaky bucket enforces a strict output rate; token bucket allows burst then smooths over time.

Algorithm	Burst Handling	Memory	Accuracy	Use Case
Fixed Window	Allows 2× burst at boundary	O(1)	Low	Simple APIs, low precision needed
Sliding Window Log	No burst (exact)	O(N) per client	High	Low-traffic, high-precision needs
Sliding Window Counter	Smooth approximation	O(1)	Good	Production rate limiters (Cloudflare)
Token Bucket	Allows burst up to capacity	O(1)	Good	AWS, NGINX, most cloud systems
Leaky Bucket	No burst (smoothed output)	O(queue size)	High output	Smoothing traffic to downstream

Distributed Rate Limiting

In a multi-node deployment, each node maintaining its own counter means a client can send N requests to each of 10 nodes for 10×N total. Distributed rate limiting requires a shared store: Redis with atomic Lua scripts (or `INCR` + `EXPIRE`) is the most common approach. Redis's single-threaded model guarantees atomicity. Alternatively, use a service mesh sidecar (Envoy) or a dedicated rate-limiting service (e.g., Lyft's Ratelimit, Kong).

lua

-- Redis Lua script for sliding-window-counter rate limiting
-- Called atomically via EVAL

local key = KEYS[1]           -- e.g. "ratelimit:user:123"
local now = tonumber(ARGV[1]) -- current timestamp (ms)
local window = tonumber(ARGV[2]) -- window size (ms)
local limit = tonumber(ARGV[3])  -- max requests

-- Remove entries older than the window
redis.call("ZREMRANGEBYSCORE", key, 0, now - window)

-- Count remaining entries
local count = redis.call("ZCARD", key)

if count < limit then
  -- Add current request
  redis.call("ZADD", key, now, now)
  redis.call("PEXPIRE", key, window)
  return 1  -- allowed
else
  return 0  -- rejected
end

Client-Side vs Server-Side Throttling

Server-side throttling enforces limits at the API gateway or service, responding with 429. It protects the server but doesn't prevent wasted network calls. Client-side throttling proactively limits outgoing request rate to stay within known limits, improving efficiency by not even sending requests that would be rejected. Google's client libraries implement both: they self-throttle based on observed error rates and respect `Retry-After` headers from 429 responses.

💡

Interview Tip

For a 'Design a Rate Limiter' question: start with requirements (per-user vs global, per-endpoint, hard vs soft limits), then choose an algorithm (token bucket is usually the right answer for flexibility), then address distribution (Redis atomic scripts), then discuss edge cases (clock skew in distributed systems, handling 429 with Retry-After on the client side, rate limit headers in responses like X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset).

Practice this pattern

Design a rate limiter for an API gateway

Timeout Pattern

Queue-Based Load Leveling