Throttling Pattern
Control resource consumption: token bucket, leaky bucket, fixed window, sliding window algorithms. Client-side vs server-side throttling.
Throttling and Rate Limiting
Throttling controls the rate at which a consumer can send requests to a service — protecting the service from overload, ensuring fair resource sharing among clients, and enforcing business policies (e.g., API tier limits). Rate limiting is the mechanism used to implement throttling. When a client exceeds its limit, the server responds with `429 Too Many Requests` and optionally a `Retry-After` header.
Algorithm 1: Fixed Window Counter
The simplest algorithm: count requests in a fixed time window (e.g., 100 requests per minute). Reset the counter at the start of each window. The weakness is the boundary burst problem: a client can send 100 requests at 12:00:59 and 100 more at 12:01:00, for 200 requests in ~2 seconds — double the intended rate.
Algorithm 2: Sliding Window Log
Store a timestamped log of each request. On each new request, remove entries older than the window and count the remaining. Allows exactly N requests in any rolling window. Accurate but memory-intensive: stores every request timestamp, so it scales poorly for high-traffic APIs.
Algorithm 3: Sliding Window Counter
A practical hybrid: combine the current window's count with the previous window's count weighted by how much of the current window has elapsed. For example: `allowed = prev_count × (1 - elapsed/window) + curr_count`. Smooth, memory-efficient, and used in production systems like Cloudflare's rate limiter.
Algorithm 4: Token Bucket
A bucket holds up to N tokens. Tokens are added at a constant rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket allows bursting up to the bucket capacity while enforcing an average rate. Used by AWS API Gateway, NGINX, and most cloud throttling systems.
Algorithm 5: Leaky Bucket
Requests enter a queue (the 'bucket') and are processed at a fixed rate (they 'leak' out). If the queue is full, new requests are dropped. This smooths bursts into a constant output rate — useful when downstream services cannot handle bursts. The key difference from token bucket: leaky bucket enforces a strict output rate; token bucket allows burst then smooths over time.
| Algorithm | Burst Handling | Memory | Accuracy | Use Case |
|---|---|---|---|---|
| Fixed Window | Allows 2× burst at boundary | O(1) | Low | Simple APIs, low precision needed |
| Sliding Window Log | No burst (exact) | O(N) per client | High | Low-traffic, high-precision needs |
| Sliding Window Counter | Smooth approximation | O(1) | Good | Production rate limiters (Cloudflare) |
| Token Bucket | Allows burst up to capacity | O(1) | Good | AWS, NGINX, most cloud systems |
| Leaky Bucket | No burst (smoothed output) | O(queue size) | High output | Smoothing traffic to downstream |
Distributed Rate Limiting
In a multi-node deployment, each node maintaining its own counter means a client can send N requests to each of 10 nodes for 10×N total. Distributed rate limiting requires a shared store: Redis with atomic Lua scripts (or `INCR` + `EXPIRE`) is the most common approach. Redis's single-threaded model guarantees atomicity. Alternatively, use a service mesh sidecar (Envoy) or a dedicated rate-limiting service (e.g., Lyft's Ratelimit, Kong).
-- Redis Lua script for sliding-window-counter rate limiting
-- Called atomically via EVAL
local key = KEYS[1] -- e.g. "ratelimit:user:123"
local now = tonumber(ARGV[1]) -- current timestamp (ms)
local window = tonumber(ARGV[2]) -- window size (ms)
local limit = tonumber(ARGV[3]) -- max requests
-- Remove entries older than the window
redis.call("ZREMRANGEBYSCORE", key, 0, now - window)
-- Count remaining entries
local count = redis.call("ZCARD", key)
if count < limit then
-- Add current request
redis.call("ZADD", key, now, now)
redis.call("PEXPIRE", key, window)
return 1 -- allowed
else
return 0 -- rejected
endClient-Side vs Server-Side Throttling
Server-side throttling enforces limits at the API gateway or service, responding with 429. It protects the server but doesn't prevent wasted network calls. Client-side throttling proactively limits outgoing request rate to stay within known limits, improving efficiency by not even sending requests that would be rejected. Google's client libraries implement both: they self-throttle based on observed error rates and respect `Retry-After` headers from 429 responses.
Interview Tip
For a 'Design a Rate Limiter' question: start with requirements (per-user vs global, per-endpoint, hard vs soft limits), then choose an algorithm (token bucket is usually the right answer for flexibility), then address distribution (Redis atomic scripts), then discuss edge cases (clock skew in distributed systems, handling 429 with Retry-After on the client side, rate limit headers in responses like X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset).
Practice this pattern
Design a rate limiter for an API gateway