Design a Notification System: Message Queues, Pub/Sub & System Design

Async messaging: the producer sends to the queue and continues immediately (fire-and-forget). The queue buffers the message. The consumer processes it later. The producer is never blocked.

Q2 A consumer should send an ACK to the broker:

ABefore starting to process the message
BOnly after the message has been fully processed and side effects committed
CImmediately upon receiving the message
DACK is not needed in message queues

✓ CORRECT: B

ACK only after processing is complete and side effects are committed. If you ACK before processing and then crash, the message is lost — the broker already removed it.

Q3 If a consumer crashes before sending an ACK, the message:

AIs permanently lost
BBecomes visible again after the visibility timeout and is redelivered
CIs automatically sent to the DLQ
DIs deleted from the queue

✓ CORRECT: B

The visibility timeout (SQS) or unacked message TTL (RabbitMQ) expires. The message becomes visible again and is redelivered to another consumer. This is the at-least-once guarantee in action.

Q4 The competing consumers pattern distributes work by:

ASending each message to all consumers
BHaving multiple consumers pull from the same queue, each getting different messages
CHaving the producer decide which consumer gets each message
DRunning all consumers on the same server

✓ CORRECT: B

Competing consumers: multiple consumers pull from the same queue. The broker delivers each message to exactly one consumer. Work is distributed automatically (round-robin or least-busy).

Q5 When queue depth is consistently growing, you should:

AIncrease the message TTL
BAdd more consumer instances to increase processing throughput
CReduce the number of producers
DDelete messages from the queue

✓ CORRECT: B

Growing queue depth means consumers cannot keep up. Add more consumer instances. More workers = more parallel processing = queue drains faster. This is the primary scaling mechanism.

Q6 The primary advantage of async messaging over sync HTTP is:

AFaster response time for the end user
BDecoupling: producer and consumer operate independently, queue buffers spikes
CSimpler infrastructure
DStronger consistency guarantees

✓ CORRECT: B

Async messaging decouples producer from consumer: they operate independently, the queue buffers traffic spikes, and if the consumer is down, messages wait instead of causing producer failures.

Q7 A visibility timeout in SQS ensures:

AMessages are encrypted during transit
BA message being processed is hidden from other consumers until ACKed or timeout expires
CMessages expire after a fixed time
DOnly one producer can send at a time

✓ CORRECT: B

Visibility timeout: while a consumer is processing a message, it is hidden from other consumers. If the consumer does not ACK within the timeout, the message becomes visible again for redelivery.

Q8 Which scenario is best suited for a message queue?

AUser login authentication
BReal-time search query
CSending order confirmation emails after purchase
DFetching a user's profile page

✓ CORRECT: C

Sending emails after purchase is async: the user does not wait for email delivery. Login, search, and profile fetch all need immediate responses — these are synchronous operations.

Section B · 7 Questions

Pub/Sub & Kafka

Q9 In Kafka, a consumer group receives:

AOnly one message from the entire topic
BAll messages in the topic, with partitions distributed among group members
CA random subset of messages
DMessages only from one partition

✓ CORRECT: B

A consumer group receives ALL messages in the topic. Partitions are distributed among group members — each member handles a subset of partitions but the group as a whole gets everything.

Q10 If a Kafka topic has 6 partitions and a consumer group has 3 consumers, each consumer handles:

A6 partitions each
B2 partitions each
C1 partition each
DAll partitions share all consumers

✓ CORRECT: B

6 partitions / 3 consumers = 2 partitions per consumer. Kafka evenly distributes partitions within a consumer group. Adding a 4th consumer would make it 6/4, some get 2, some get 1.

Q11 Kafka guarantees message ordering:

AAcross all partitions in a topic
BWithin a single partition only
COnly if there is one consumer
DNever — Kafka does not guarantee ordering

✓ CORRECT: B

Ordering is guaranteed within a single partition (messages appended in order, consumed in order). Across partitions, there is no ordering guarantee. Use partition keys for per-entity ordering.

Q12 To ensure all events for order #42 are processed in order, you should:

AUse a single partition for the entire topic
BUse order_id as the partition key so all events for order #42 go to the same partition
CSort events in the consumer before processing
DUse multiple consumer groups

✓ CORRECT: B

Using order_id as partition key ensures all events for order #42 hash to the same partition. Within that partition, events are ordered (created → paid → shipped → delivered).

Q13 Multiple consumer groups subscribing to the same Kafka topic implements:

APoint-to-point messaging
BPublish-subscribe (each group gets all messages independently)
CRequest-reply pattern
DLoad balancing within one service

✓ CORRECT: B

Multiple consumer groups = pub/sub. Each group independently receives all messages. Within each group, it is a work queue. This is how Kafka combines both patterns in one system.

Q14 Kafka retains messages after consumption because:

AIt is a bug in Kafka
BIt enables replay: consumers can reset their offset to reprocess historical events
CMessages cannot be deleted in Kafka
DConsumers never actually read the messages

✓ CORRECT: B

Kafka retains messages (default 7 days) to enable replay. Consumers can reset their offset to reprocess historical events — invaluable for debugging, backfilling data, and rebuilding state.

Q15 Adding a new consumer group to an existing Kafka topic requires:

AModifying the producer code and redeploying
BZero changes to the producer — the new group independently consumes from the topic
CDeleting and recreating the topic
DStopping all existing consumers

✓ CORRECT: B

New consumer groups subscribe independently. The producer has no knowledge of its consumers. Zero code changes to the producer. This is the core decoupling benefit of event-driven architecture.

Section C · 8 Questions

Retries, DLQ & Delivery

Q16 Exponential backoff with jitter retries at intervals of approximately:

A0s, 0s, 0s, 0s (immediate)
B5s, 5s, 5s, 5s (fixed)
C~1s, ~2s, ~4s, ~8s (doubling + random offset)
D60s, 60s, 60s, 60s

✓ CORRECT: C

Exponential backoff doubles the wait: ~1s, ~2s, ~4s, ~8s. Jitter adds random offset to prevent synchronized retries. This is the gold standard used by AWS SDK, gRPC, and Stripe.

Q17 Random jitter is added to exponential backoff to prevent:

AMessages from expiring
BSynchronized retry storms where many consumers retry at the exact same time
CThe queue from growing
DDuplicate messages

✓ CORRECT: B

Without jitter, 1000 consumers all retry at exactly 1s, 2s, 4s — creating synchronized storms. Jitter spreads retries over time, smoothing the load on the recovering service.

Q18 A Dead Letter Queue (DLQ) stores:

AAll messages in the system
BMessages that failed processing after all retry attempts are exhausted
CMessages waiting to be delivered
DEncrypted messages only

✓ CORRECT: B

DLQ stores messages that exhausted all retries. These need manual investigation: the consumer has a bug, the data is invalid, or a dependency is permanently down.

Q19 When a message has an invalid JSON schema (permanent error), the consumer should:

ARetry with exponential backoff
BSend it to the DLQ immediately — retrying will never fix a schema error
CIgnore the message silently
DRestart the consumer

✓ CORRECT: B

Invalid JSON is a permanent error — retrying will never fix it. Send to DLQ immediately to avoid wasting retry resources. Only retry transient errors (timeout, 429, 503).

Q20 At-least-once delivery means:

AMessages may be lost but never duplicated
BEvery message is delivered at least once; duplicates are possible
CEvery message is delivered exactly once
DMessages are never delivered

✓ CORRECT: B

At-least-once: the broker retries until ACKed. If the consumer processes but crashes before ACKing, the message is redelivered (duplicate). Consumers must be idempotent.

Q21 To handle at-least-once delivery, consumers should be:

AStateless
BIdempotent — processing the same message twice produces the same result
CSingle-threaded
DConnected to multiple queues

✓ CORRECT: B

Idempotent consumers produce the same result when processing the same message twice. Use a deduplication check (message_id in Redis SET) to detect and skip duplicates.

Q22 A poison message is a message that:

AContains sensitive data
BCrashes the consumer every time it is processed, causing an infinite retry loop
CHas been in the queue too long
DWas sent by an unauthorized producer

✓ CORRECT: B

A poison message crashes the consumer on every attempt, creating an infinite retry loop. Solution: track per-message retry count. After N failures, send to DLQ and continue.

Q23 The best DLQ monitoring practice is:

ACheck the DLQ once a month
BAlert immediately when any message arrives in the DLQ (it means something is broken)
CIgnore the DLQ unless users complain
DAutomatically delete DLQ messages after 1 hour

✓ CORRECT: B

DLQ messages mean something is broken. Alert immediately (PagerDuty/Slack). An unmonitored DLQ is a data graveyard. Review regularly, fix bugs, and replay.

Section D · 7 Questions

Event-Driven Architecture

Q24 In event-driven architecture, services communicate by:

ACalling each other's HTTP APIs directly
BPublishing events to a central bus (Kafka); consumers react independently
CSharing a single database
DSending files to each other

✓ CORRECT: B

Event-driven: services publish events to a central bus (Kafka). Consumers react independently. No direct service-to-service calls. Loose coupling, independent scaling.

Q25 The key benefit of event-driven over request-driven architecture is:

AFaster individual request latency
BLoose coupling: adding a new consumer requires zero changes to the publisher
CSimpler debugging
DStronger consistency

✓ CORRECT: B

Key benefit: loose coupling. Adding a new consumer (e.g., Fraud Detection) requires zero changes to any existing service. The new service just subscribes to relevant topics.

Q26 The Saga pattern is used for:

ACaching data across services
BDistributed transactions across microservices using compensating actions
CLoad balancing between servers
DDatabase indexing

✓ CORRECT: B

Saga pattern: distributed transactions as a sequence of local transactions + compensating actions. If step 3 fails, compensating events undo steps 1 and 2 (eventual rollback).

Q27 If step 3 of a 5-step Saga fails, the system:

ARetries step 3 forever
BExecutes compensating actions to undo steps 1 and 2, then cancels
CIgnores the failure and continues with step 4
DRolls back the database automatically (ACID)

✓ CORRECT: B

Saga compensation: when step 3 fails, compensating actions fire to undo steps 1 and 2 (e.g., refund payment, cancel order). The end state is consistent, but eventually, not immediately.

Q28 An event should be named as a past-tense fact because:

AIt is a naming convention with no practical benefit
BIt describes what happened (fact), not what should happen (command), enabling independent consumer decisions
CPast tense is easier to type
DKafka requires past-tense event names

✓ CORRECT: B

Events are facts ('order.created' = this happened). Each consumer independently decides how to react. Commands ('send_email') couple the publisher to specific consumer behavior.

Q29 The main trade-off of event-driven vs request-driven architecture is:

AEvent-driven is slower
BEvent-driven has eventual consistency and harder debugging, but better decoupling and scalability
CRequest-driven scales better
DThere is no trade-off

✓ CORRECT: B

Event-driven trade-offs: eventual consistency (not immediate), harder debugging (trace events across services). Benefits: loose coupling, independent scaling, failure isolation.

Q30 For a system design interview, the default message queue choice should be:

ARedis Pub/Sub
BApache Kafka (supports both work queue and pub/sub, replay, high throughput)
CA custom-built queue
DEmail as a message queue

✓ CORRECT: B

Kafka is the default for system design interviews. It supports both work queue (single consumer group) and pub/sub (multiple groups), message replay, high throughput, and is industry standard.

Part 2

Design a Notification System

Notification systems are one of the most common system design interview questions because they require every concept from this class: message queues for decoupling, pub/sub for fan-out, retries with backoff for reliability, dead letter queues for error handling, and event-driven architecture for extensibility. This exercise walks through a production-grade design that handles 10 billion notifications per day across push, email, SMS, and in-app channels.

Notification system overview — 6 trigger event types, 4 delivery channels, 500M users, 10B notifications per day

Figure 1: Notification system overview — 6 trigger event types, 4 delivery channels, 500M users, 10B notifications/day

SCALE

Scale Context

500M users × 20 notifications/day average = 10B notifications/day. Peak: 200K notifications/second (during events like New Year, flash sales). Each notification may be delivered on multiple channels (push + in-app = 2 deliveries per notification). Total deliveries: ~15B/day. Latency target: <1 second from event to device.

STEP 1

Architecture

Figure 2: Complete architecture — trigger services → Kafka → Notification Service → per-channel queues → providers

The architecture follows the event-driven pattern. Any service in the system can trigger a notification by publishing an event to Kafka (topic: notification.events). The Notification Service consumes events, checks user preferences, applies rate limiting and deduplication, renders the message using templates, and enqueues to the appropriate channel queue(s).

Each channel has its own dedicated queue and worker fleet that handles the specific provider API (APNS for iOS push, FCM for Android, SendGrid for email, Twilio for SMS, WebSocket for in-app).

Why Per-Channel Queues?

Each delivery channel has different characteristics: push is fast but rate-limited by Apple/Google, email is slow but high-volume, SMS is expensive and very limited, in-app is instant for online users but requires store-and-forward for offline users.

Separate queues allow independent scaling (more push workers during a viral event), independent retry strategies (email retries over hours, push retries over minutes), and independent DLQs — an email bounce does not affect push delivery.

STEP 2

Event Schema & Routing

Figure 3: Notification event schema — event_id, type, channels, priority, and idempotency_key for deduplication

Every notification event contains: event_id (unique identifier for tracking), type (determines the template, e.g., social.new_follower), user_id (recipient), data (dynamic content like follower_name), channels (which delivery channels to use), priority (normal/high/critical), and idempotency_key (prevents duplicate notifications).

The Notification Service uses the type to select a message template, the channels field (intersected with user preferences) to determine delivery targets, and the idempotency_key to skip already-sent notifications.

STEP 3

User Preferences & Quiet Hours

Figure 4: User preference matrix — per event type, per channel. Plus quiet hours for push/SMS delay.

Users control which notifications they receive on which channels. This preference matrix is stored in Redis for sub-millisecond lookups (key: user_prefs:{user_id}, value: JSON). The Notification Service intersects the event's requested channels with the user's enabled channels.

Quiet Hours: Users can set a Do Not Disturb window (e.g., 10 PM – 8 AM). During quiet hours, push notifications and SMS are delayed and batched for morning delivery. Email is unaffected (read at user's convenience). Critical notifications (security alerts, OTP codes) bypass quiet hours completely.

STEP 4

Rate Limiting & Deduplication

Figure 5: Rate limits per channel, idempotency-based deduplication, and notification batching/aggregation

Rate Limiting: Each channel has per-user rate limits — push 10/hour, email 5/day, SMS 3/day, in-app 50/hour. Implemented with Redis counters: INCR rate:{user_id}:{channel}:{window} with TTL matching the window. When the limit is exceeded, the notification is dropped (low priority) or queued for the next window (high priority).

Deduplication: Every event has an idempotency_key. Before sending, the Notification Service checks Redis: SISMEMBER sent_notifs:{user_id} "follow_99_42". If the key exists, the notification was already sent (skip). If not, send and SADD with a 24-hour TTL. This prevents duplicates from Kafka retries or duplicate events from upstream services.

Interview Tip: Always Mention Rate Limiting + Dedup

These two features separate a good notification design from a great one: "I implement per-user per-channel rate limits in Redis to prevent notification spam. I use idempotency keys for deduplication to prevent duplicates from Kafka retries. I batch high-frequency events into aggregated notifications ('48 people liked your photo') using a 5-minute buffer window." Interviewers love this level of detail.

STEP 5

Push Notification Delivery

Figure 6: Push delivery flow — Notification Service → Device Registry → Platform Router → APNS/FCM → User's phone

Push notification delivery requires a device token registry. When a user installs the app, the device registers its push token (APNS token for iOS, FCM token for Android) with the backend. These tokens are stored in Redis: device_tokens:{user_id} = set of tokens.

Users with multiple devices (phone + tablet) have multiple tokens. The push worker looks up all tokens for the user, routes to the correct platform (iOS → Apple APNS, Android → Google FCM), and sends. If APNS/FCM returns an "invalid token" error, the token is removed from the registry (user uninstalled the app or the token was refreshed).

STEP 6

Per-Channel Retry & DLQ Strategy

Figure 7: Each channel has different retry strategies based on its provider's error types and characteristics

Each delivery channel has its own retry strategy because failure modes differ drastically. Push retries are fast because APNS/FCM responses are immediate. Email retries are slow because email delivery is inherently delayed. In-app uses store-and-forward: if the user is offline, the notification is stored in a Redis list and delivered when the user's WebSocket reconnects.

Channel	Max Retries	Backoff	Permanent Errors → DLQ	Provider
Push	3	1s, 5s, 15s	Invalid token (remove + DLQ)	APNS (iOS), FCM (Android)
Email	5	30s, 60s, 5m, 30m, 1h	Hard bounce (mark invalid + DLQ)	SendGrid, AWS SES
SMS	3	5s, 30s, 5m	Invalid number (DLQ)	Twilio, AWS SNS
In-App	2	Store + forward	N/A (persisted in Redis)	WebSocket, SSE

STEP 7

Analytics & Monitoring

Figure 8: Notification funnel — track created → filtered → sent → delivered → opened/clicked at each stage

Every notification passes through a tracking funnel: Created (event received) → Filtered (rate-limited or preference-blocked) → Sent (enqueued to provider) → Delivered (provider confirms) → Opened (user tapped) → Clicked (user interacted). Each stage publishes a tracking event to a Kafka analytics topic, flowing to ClickHouse for real-time dashboards.

Metric	Healthy	Warning	Action
Delivery rate	95%+	<90%	Check provider errors, DLQ
Open rate (push)	5–15%	<3%	Review notification content/timing
Bounce rate (email)	<2%	2–5%	Clean email list, check sender reputation
DLQ depth	0	1–10	Fix consumer bug, replay from DLQ
Latency (event to delivery)	<1 second	1–5 seconds	Scale workers, check provider latency

STEP 8

Design Checklist

Figure 9: 12-point checklist covering every aspect of the notification system design

Aspect	Design Decision	Why
Event Bus	Kafka: notification.events topic	Decouple trigger services from notification logic
Notification Service	Consumes events, routes to channel queues	Central orchestration: prefs, rate limit, dedup, template
Channel Queues	4 separate queues: push, email, SMS, in-app	Independent scaling, retry, and DLQ per channel
User Preferences	Redis: per-user per-event-type channel matrix	Sub-ms lookup. Users control their notification experience.
Rate Limiting	Redis counters: push 10/hr, email 5/day, SMS 3/day	Prevent notification fatigue and provider rate limits
Deduplication	Idempotency key in Redis SET (24h TTL)	Prevent duplicates from retries and duplicate events
Batching	5-min window: '48 people liked your photo'	90% volume reduction for high-frequency events
Quiet Hours	Delay push/SMS during DND. Email unaffected.	Respect user attention. OTP/security bypass DND.
Retries	Exponential backoff per channel. Smart error classification.	Transient errors retry. Permanent errors go to DLQ immediately.
DLQ	Per-channel DLQ. Alert on any message. Replay tooling.	Safety net. No notification permanently lost.
Device Registry	Redis SET: device_tokens per user. Invalidate stale.	Multi-device support. Clean up uninstalled apps.
Analytics	Kafka → ClickHouse: created/sent/delivered/opened funnel	Real-time dashboards. Measure notification effectiveness.

This Template Applies to Any Notification System

This design is the template for WhatsApp notifications, Slack alerts, Uber ride updates, DoorDash order tracking, and any multi-channel notification platform. The patterns are identical: Kafka event bus, central routing service, per-channel queues, user preferences, rate limiting, deduplication, batching, per-channel retry/DLQ, and delivery analytics. Master this design and apply it to any notification interview question.

Want to Land at Google, Microsoft or Apple?

Watch Pranjal Jain's free 30-min training — the exact GROW Strategy that helped 1,572+ engineers go from TCS/Infosys to top product companies with a 3–5X salary hike.

DSA + System Design roadmap 1:1 mentorship from ex-Microsoft 1,572+ placed · 4.9★ rated

Watch Free Training →

In-Class Notes

Design a Notification System
Message Queues, Pub/Sub & System Design

What's Inside

Part 1

Complete Queues Quiz — 30 Questions

Section A · 8 Questions

Queue Patterns & Consumer

Section B · 7 Questions

Pub/Sub & Kafka

Section C · 8 Questions

Retries, DLQ & Delivery

Section D · 7 Questions

Event-Driven Architecture

Part 2

Design a Notification System

Scale Context

Architecture

Event Schema & Routing

User Preferences & Quiet Hours

Rate Limiting & Deduplication

Push Notification Delivery

Per-Channel Retry & DLQ Strategy

Analytics & Monitoring

Design Checklist

Want to Land at Google, Microsoft or Apple?

Pranjal Jain

Free Training — Watch Now

Class 7 Reading Path