Design a Notification System
Multi-channel notifications (push, email, SMS, in-app): template management, user preferences, delivery tracking, rate limiting, and priority handling.
Problem Statement
A notification system delivers messages to users across multiple channels — push notifications (mobile/desktop), email, SMS, and in-app — based on events from other services. The challenges are: routing to the right channel, respecting user preferences, handling third-party delivery failures gracefully, and not spamming users (rate limiting). At scale, a system like Facebook's processes billions of notifications per day.
Requirements
| Functional | Non-Functional |
|---|---|
| Send push, email, SMS, and in-app notifications | 1 B notifications/day |
| User preference management (opt-in/out per channel and type) | Critical alerts delivered in < 5 seconds |
| Template-based messages with variable substitution | At-least-once delivery guarantee |
| Priority levels: critical, high, normal, low | 99.9% delivery success rate |
| Delivery status tracking and retry on failure | No duplicate notifications to users |
| Scheduled notifications | Rate limiting: max 10 push/hour per user |
High-Level Architecture
Message Flow
Channel Types and Third-Party Providers
| Channel | Provider Examples | Latency | Cost | Reliability |
|---|---|---|---|---|
| Push (iOS) | Apple APNs | < 1 sec | Free | High (~99%) |
| Push (Android) | Firebase FCM | < 1 sec | Free | High (~99%) |
| Amazon SES, SendGrid | 1-30 sec | Low ($0.0001/email) | Medium (98%) | |
| SMS | Twilio, AWS SNS | 1-10 sec | High ($0.0075/SMS) | Medium (97%) |
| In-App | Custom (Redis + WebSocket) | < 100 ms | Infrastructure cost | High (controlled) |
User Preferences
Every notification must respect user preferences. A preference matrix maps `(userId, notificationType, channel)` → `enabled`. This is stored in MySQL for durability and cached in Redis for low-latency lookup. Example schema:
CREATE TABLE notification_preferences (
user_id BIGINT NOT NULL,
notif_type VARCHAR(64) NOT NULL, -- e.g., 'order_update', 'marketing', 'chat'
channel VARCHAR(16) NOT NULL, -- 'push', 'email', 'sms', 'in_app'
enabled BOOLEAN DEFAULT TRUE,
updated_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (user_id, notif_type, channel)
);
-- Cache key pattern: prefs:{userId}:{notif_type} → JSON blob of channel flags
-- TTL: 5 minutes (user changes are eventually consistent)Priority Queues and Rate Limiting
Not all notifications are equal. Use separate Kafka topics by priority: `notif.critical`, `notif.high`, `notif.normal`, `notif.low`. Workers are assigned to topic groups with critical having the most consumers. Rate limiting prevents notification fatigue:
- Per-user rate limit: max 10 push notifications per hour per user (token bucket in Redis).
- Marketing vs transactional: transactional (order confirmed, payment failed) bypass rate limits; marketing is rate-limited.
- Quiet hours: respect user time zones — suppress non-critical notifications during user-defined quiet hours (e.g., 11 PM – 7 AM).
- Deduplication: assign each notification a `notifId` (UUID); check Redis for duplicate before sending. TTL: 24 hours.
Retry and Dead Letter Queue
Third-party providers (FCM, Twilio) can fail temporarily. Implement exponential backoff retries with a cap (e.g., 3 attempts: immediately, 1 min, 5 min). After all retries are exhausted, send to a Dead Letter Queue (DLQ) for monitoring and manual inspection. Alert on DLQ depth exceeding a threshold.
Idempotency Is Critical
With at-least-once delivery semantics (Kafka), a notification worker may process the same event twice. Without deduplication, users receive duplicate notifications. Always check a short-lived Redis key (`notif:{notifId}:sent`) before dispatching. Use `SET NX EX 86400` to atomically check and set in one operation.
Template System
Notification content is driven by templates with variable substitution. Templates are stored in a database (MySQL) and cached in Redis. Example template for an order confirmation:
Template ID: order_confirmed_push
Channel: push
Subject: "Order Confirmed!"
Body: "Hi {{firstName}}, your order #{{orderId}} has been confirmed.
Estimated delivery: {{deliveryDate}}."
Runtime rendering:
variables = { firstName: "Alice", orderId: "ORD-12345", deliveryDate: "Feb 22" }
rendered = "Hi Alice, your order #ORD-12345 has been confirmed.
Estimated delivery: Feb 22."Scaling Considerations
- Kafka partitioning: partition `notif.normal` by `userId` to maintain per-user ordering and avoid multiple workers handling the same user simultaneously (prevents duplicate sends).
- Worker auto-scaling: scale workers based on Kafka consumer lag. Lag > 10K messages → add workers. Critical topic always has minimum N workers.
- Connection pooling to third-party APIs: FCM and SES have connection limits. Use an HTTP connection pool per worker to maximize throughput.
- In-app notification storage: store in Cassandra (`(userId, timestamp)` partition) for inbox functionality. Unread count in Redis counter.
- Delivery analytics: track open rates, click rates, unsubscribes — essential for tuning rate limits and detecting deliverability issues.
Interview Tip
This problem is deceptively simple — don't underestimate it. The three concepts interviewers want to see are: (1) decoupled event-driven architecture (services publish events, notification workers consume), (2) user preference lookup with caching, and (3) idempotency/deduplication to prevent duplicate sends. Bonus points for mentioning quiet hours and the DLQ pattern.
Practice this pattern
Design a notification system for millions of users