Part 1

Complete Queues Quiz — 30 Questions

Messaging & queues quiz — 30 questions across 4 sections
30 questions covering all Class 7 messaging & queues topics — target 24+ correct (80%+) to be interview-ready
0 / 30
Questions Revealed
Target: 24+ correct (80%) to be interview-ready
Q1 In asynchronous messaging, the producer:
  • AWaits for the consumer to finish processing before continuing
  • BSends a message to a queue and continues immediately without waiting
  • CCalls the consumer directly via HTTP
  • DStores the message in its own database
✓ CORRECT: B

Async messaging: the producer sends to the queue and continues immediately (fire-and-forget). The queue buffers the message. The consumer processes it later. The producer is never blocked.

Q2 A consumer should send an ACK to the broker:
  • ABefore starting to process the message
  • BOnly after the message has been fully processed and side effects committed
  • CImmediately upon receiving the message
  • DACK is not needed in message queues
✓ CORRECT: B

ACK only after processing is complete and side effects are committed. If you ACK before processing and then crash, the message is lost — the broker already removed it.

Q3 If a consumer crashes before sending an ACK, the message:
  • AIs permanently lost
  • BBecomes visible again after the visibility timeout and is redelivered
  • CIs automatically sent to the DLQ
  • DIs deleted from the queue
✓ CORRECT: B

The visibility timeout (SQS) or unacked message TTL (RabbitMQ) expires. The message becomes visible again and is redelivered to another consumer. This is the at-least-once guarantee in action.

Q4 The competing consumers pattern distributes work by:
  • ASending each message to all consumers
  • BHaving multiple consumers pull from the same queue, each getting different messages
  • CHaving the producer decide which consumer gets each message
  • DRunning all consumers on the same server
✓ CORRECT: B

Competing consumers: multiple consumers pull from the same queue. The broker delivers each message to exactly one consumer. Work is distributed automatically (round-robin or least-busy).

Q5 When queue depth is consistently growing, you should:
  • AIncrease the message TTL
  • BAdd more consumer instances to increase processing throughput
  • CReduce the number of producers
  • DDelete messages from the queue
✓ CORRECT: B

Growing queue depth means consumers cannot keep up. Add more consumer instances. More workers = more parallel processing = queue drains faster. This is the primary scaling mechanism.

Q6 The primary advantage of async messaging over sync HTTP is:
  • AFaster response time for the end user
  • BDecoupling: producer and consumer operate independently, queue buffers spikes
  • CSimpler infrastructure
  • DStronger consistency guarantees
✓ CORRECT: B

Async messaging decouples producer from consumer: they operate independently, the queue buffers traffic spikes, and if the consumer is down, messages wait instead of causing producer failures.

Q7 A visibility timeout in SQS ensures:
  • AMessages are encrypted during transit
  • BA message being processed is hidden from other consumers until ACKed or timeout expires
  • CMessages expire after a fixed time
  • DOnly one producer can send at a time
✓ CORRECT: B

Visibility timeout: while a consumer is processing a message, it is hidden from other consumers. If the consumer does not ACK within the timeout, the message becomes visible again for redelivery.

Q8 Which scenario is best suited for a message queue?
  • AUser login authentication
  • BReal-time search query
  • CSending order confirmation emails after purchase
  • DFetching a user's profile page
✓ CORRECT: C

Sending emails after purchase is async: the user does not wait for email delivery. Login, search, and profile fetch all need immediate responses — these are synchronous operations.

Q9 In Kafka, a consumer group receives:
  • AOnly one message from the entire topic
  • BAll messages in the topic, with partitions distributed among group members
  • CA random subset of messages
  • DMessages only from one partition
✓ CORRECT: B

A consumer group receives ALL messages in the topic. Partitions are distributed among group members — each member handles a subset of partitions but the group as a whole gets everything.

Q10 If a Kafka topic has 6 partitions and a consumer group has 3 consumers, each consumer handles:
  • A6 partitions each
  • B2 partitions each
  • C1 partition each
  • DAll partitions share all consumers
✓ CORRECT: B

6 partitions / 3 consumers = 2 partitions per consumer. Kafka evenly distributes partitions within a consumer group. Adding a 4th consumer would make it 6/4, some get 2, some get 1.

Q11 Kafka guarantees message ordering:
  • AAcross all partitions in a topic
  • BWithin a single partition only
  • COnly if there is one consumer
  • DNever — Kafka does not guarantee ordering
✓ CORRECT: B

Ordering is guaranteed within a single partition (messages appended in order, consumed in order). Across partitions, there is no ordering guarantee. Use partition keys for per-entity ordering.

Q12 To ensure all events for order #42 are processed in order, you should:
  • AUse a single partition for the entire topic
  • BUse order_id as the partition key so all events for order #42 go to the same partition
  • CSort events in the consumer before processing
  • DUse multiple consumer groups
✓ CORRECT: B

Using order_id as partition key ensures all events for order #42 hash to the same partition. Within that partition, events are ordered (created → paid → shipped → delivered).

Q13 Multiple consumer groups subscribing to the same Kafka topic implements:
  • APoint-to-point messaging
  • BPublish-subscribe (each group gets all messages independently)
  • CRequest-reply pattern
  • DLoad balancing within one service
✓ CORRECT: B

Multiple consumer groups = pub/sub. Each group independently receives all messages. Within each group, it is a work queue. This is how Kafka combines both patterns in one system.

Q14 Kafka retains messages after consumption because:
  • AIt is a bug in Kafka
  • BIt enables replay: consumers can reset their offset to reprocess historical events
  • CMessages cannot be deleted in Kafka
  • DConsumers never actually read the messages
✓ CORRECT: B

Kafka retains messages (default 7 days) to enable replay. Consumers can reset their offset to reprocess historical events — invaluable for debugging, backfilling data, and rebuilding state.

Q15 Adding a new consumer group to an existing Kafka topic requires:
  • AModifying the producer code and redeploying
  • BZero changes to the producer — the new group independently consumes from the topic
  • CDeleting and recreating the topic
  • DStopping all existing consumers
✓ CORRECT: B

New consumer groups subscribe independently. The producer has no knowledge of its consumers. Zero code changes to the producer. This is the core decoupling benefit of event-driven architecture.

Q16 Exponential backoff with jitter retries at intervals of approximately:
  • A0s, 0s, 0s, 0s (immediate)
  • B5s, 5s, 5s, 5s (fixed)
  • C~1s, ~2s, ~4s, ~8s (doubling + random offset)
  • D60s, 60s, 60s, 60s
✓ CORRECT: C

Exponential backoff doubles the wait: ~1s, ~2s, ~4s, ~8s. Jitter adds random offset to prevent synchronized retries. This is the gold standard used by AWS SDK, gRPC, and Stripe.

Q17 Random jitter is added to exponential backoff to prevent:
  • AMessages from expiring
  • BSynchronized retry storms where many consumers retry at the exact same time
  • CThe queue from growing
  • DDuplicate messages
✓ CORRECT: B

Without jitter, 1000 consumers all retry at exactly 1s, 2s, 4s — creating synchronized storms. Jitter spreads retries over time, smoothing the load on the recovering service.

Q18 A Dead Letter Queue (DLQ) stores:
  • AAll messages in the system
  • BMessages that failed processing after all retry attempts are exhausted
  • CMessages waiting to be delivered
  • DEncrypted messages only
✓ CORRECT: B

DLQ stores messages that exhausted all retries. These need manual investigation: the consumer has a bug, the data is invalid, or a dependency is permanently down.

Q19 When a message has an invalid JSON schema (permanent error), the consumer should:
  • ARetry with exponential backoff
  • BSend it to the DLQ immediately — retrying will never fix a schema error
  • CIgnore the message silently
  • DRestart the consumer
✓ CORRECT: B

Invalid JSON is a permanent error — retrying will never fix it. Send to DLQ immediately to avoid wasting retry resources. Only retry transient errors (timeout, 429, 503).

Q20 At-least-once delivery means:
  • AMessages may be lost but never duplicated
  • BEvery message is delivered at least once; duplicates are possible
  • CEvery message is delivered exactly once
  • DMessages are never delivered
✓ CORRECT: B

At-least-once: the broker retries until ACKed. If the consumer processes but crashes before ACKing, the message is redelivered (duplicate). Consumers must be idempotent.

Q21 To handle at-least-once delivery, consumers should be:
  • AStateless
  • BIdempotent — processing the same message twice produces the same result
  • CSingle-threaded
  • DConnected to multiple queues
✓ CORRECT: B

Idempotent consumers produce the same result when processing the same message twice. Use a deduplication check (message_id in Redis SET) to detect and skip duplicates.

Q22 A poison message is a message that:
  • AContains sensitive data
  • BCrashes the consumer every time it is processed, causing an infinite retry loop
  • CHas been in the queue too long
  • DWas sent by an unauthorized producer
✓ CORRECT: B

A poison message crashes the consumer on every attempt, creating an infinite retry loop. Solution: track per-message retry count. After N failures, send to DLQ and continue.

Q23 The best DLQ monitoring practice is:
  • ACheck the DLQ once a month
  • BAlert immediately when any message arrives in the DLQ (it means something is broken)
  • CIgnore the DLQ unless users complain
  • DAutomatically delete DLQ messages after 1 hour
✓ CORRECT: B

DLQ messages mean something is broken. Alert immediately (PagerDuty/Slack). An unmonitored DLQ is a data graveyard. Review regularly, fix bugs, and replay.

Q24 In event-driven architecture, services communicate by:
  • ACalling each other's HTTP APIs directly
  • BPublishing events to a central bus (Kafka); consumers react independently
  • CSharing a single database
  • DSending files to each other
✓ CORRECT: B

Event-driven: services publish events to a central bus (Kafka). Consumers react independently. No direct service-to-service calls. Loose coupling, independent scaling.

Q25 The key benefit of event-driven over request-driven architecture is:
  • AFaster individual request latency
  • BLoose coupling: adding a new consumer requires zero changes to the publisher
  • CSimpler debugging
  • DStronger consistency
✓ CORRECT: B

Key benefit: loose coupling. Adding a new consumer (e.g., Fraud Detection) requires zero changes to any existing service. The new service just subscribes to relevant topics.

Q26 The Saga pattern is used for:
  • ACaching data across services
  • BDistributed transactions across microservices using compensating actions
  • CLoad balancing between servers
  • DDatabase indexing
✓ CORRECT: B

Saga pattern: distributed transactions as a sequence of local transactions + compensating actions. If step 3 fails, compensating events undo steps 1 and 2 (eventual rollback).

Q27 If step 3 of a 5-step Saga fails, the system:
  • ARetries step 3 forever
  • BExecutes compensating actions to undo steps 1 and 2, then cancels
  • CIgnores the failure and continues with step 4
  • DRolls back the database automatically (ACID)
✓ CORRECT: B

Saga compensation: when step 3 fails, compensating actions fire to undo steps 1 and 2 (e.g., refund payment, cancel order). The end state is consistent, but eventually, not immediately.

Q28 An event should be named as a past-tense fact because:
  • AIt is a naming convention with no practical benefit
  • BIt describes what happened (fact), not what should happen (command), enabling independent consumer decisions
  • CPast tense is easier to type
  • DKafka requires past-tense event names
✓ CORRECT: B

Events are facts ('order.created' = this happened). Each consumer independently decides how to react. Commands ('send_email') couple the publisher to specific consumer behavior.

Q29 The main trade-off of event-driven vs request-driven architecture is:
  • AEvent-driven is slower
  • BEvent-driven has eventual consistency and harder debugging, but better decoupling and scalability
  • CRequest-driven scales better
  • DThere is no trade-off
✓ CORRECT: B

Event-driven trade-offs: eventual consistency (not immediate), harder debugging (trace events across services). Benefits: loose coupling, independent scaling, failure isolation.

Q30 For a system design interview, the default message queue choice should be:
  • ARedis Pub/Sub
  • BApache Kafka (supports both work queue and pub/sub, replay, high throughput)
  • CA custom-built queue
  • DEmail as a message queue
✓ CORRECT: B

Kafka is the default for system design interviews. It supports both work queue (single consumer group) and pub/sub (multiple groups), message replay, high throughput, and is industry standard.

Part 2

Design a Notification System

Notification systems are one of the most common system design interview questions because they require every concept from this class: message queues for decoupling, pub/sub for fan-out, retries with backoff for reliability, dead letter queues for error handling, and event-driven architecture for extensibility. This exercise walks through a production-grade design that handles 10 billion notifications per day across push, email, SMS, and in-app channels.

Notification system overview — 6 trigger event types, 4 delivery channels, 500M users, 10B notifications per day
Figure 1: Notification system overview — 6 trigger event types, 4 delivery channels, 500M users, 10B notifications/day
SCALE

Scale Context

500M users × 20 notifications/day average = 10B notifications/day. Peak: 200K notifications/second (during events like New Year, flash sales). Each notification may be delivered on multiple channels (push + in-app = 2 deliveries per notification). Total deliveries: ~15B/day. Latency target: <1 second from event to device.

STEP 1

Architecture

Complete notification architecture — trigger services → Kafka → Notification Service → per-channel queues → providers
Figure 2: Complete architecture — trigger services → Kafka → Notification Service → per-channel queues → providers

The architecture follows the event-driven pattern. Any service in the system can trigger a notification by publishing an event to Kafka (topic: notification.events). The Notification Service consumes events, checks user preferences, applies rate limiting and deduplication, renders the message using templates, and enqueues to the appropriate channel queue(s).

Each channel has its own dedicated queue and worker fleet that handles the specific provider API (APNS for iOS push, FCM for Android, SendGrid for email, Twilio for SMS, WebSocket for in-app).

Why Per-Channel Queues?

Each delivery channel has different characteristics: push is fast but rate-limited by Apple/Google, email is slow but high-volume, SMS is expensive and very limited, in-app is instant for online users but requires store-and-forward for offline users.

Separate queues allow independent scaling (more push workers during a viral event), independent retry strategies (email retries over hours, push retries over minutes), and independent DLQs — an email bounce does not affect push delivery.

STEP 2

Event Schema & Routing

Notification event schema — event_id, type, channels, priority, and idempotency_key for deduplication
Figure 3: Notification event schema — event_id, type, channels, priority, and idempotency_key for deduplication

Every notification event contains: event_id (unique identifier for tracking), type (determines the template, e.g., social.new_follower), user_id (recipient), data (dynamic content like follower_name), channels (which delivery channels to use), priority (normal/high/critical), and idempotency_key (prevents duplicate notifications).

The Notification Service uses the type to select a message template, the channels field (intersected with user preferences) to determine delivery targets, and the idempotency_key to skip already-sent notifications.

STEP 3

User Preferences & Quiet Hours

User preference matrix — per event type, per channel, plus quiet hours for push/SMS delay
Figure 4: User preference matrix — per event type, per channel. Plus quiet hours for push/SMS delay.

Users control which notifications they receive on which channels. This preference matrix is stored in Redis for sub-millisecond lookups (key: user_prefs:{user_id}, value: JSON). The Notification Service intersects the event's requested channels with the user's enabled channels.

Quiet Hours: Users can set a Do Not Disturb window (e.g., 10 PM – 8 AM). During quiet hours, push notifications and SMS are delayed and batched for morning delivery. Email is unaffected (read at user's convenience). Critical notifications (security alerts, OTP codes) bypass quiet hours completely.

STEP 4

Rate Limiting & Deduplication

Rate limits per channel, idempotency-based deduplication, and notification batching/aggregation
Figure 5: Rate limits per channel, idempotency-based deduplication, and notification batching/aggregation

Rate Limiting: Each channel has per-user rate limits — push 10/hour, email 5/day, SMS 3/day, in-app 50/hour. Implemented with Redis counters: INCR rate:{user_id}:{channel}:{window} with TTL matching the window. When the limit is exceeded, the notification is dropped (low priority) or queued for the next window (high priority).

Deduplication: Every event has an idempotency_key. Before sending, the Notification Service checks Redis: SISMEMBER sent_notifs:{user_id} "follow_99_42". If the key exists, the notification was already sent (skip). If not, send and SADD with a 24-hour TTL. This prevents duplicates from Kafka retries or duplicate events from upstream services.

Interview Tip: Always Mention Rate Limiting + Dedup

These two features separate a good notification design from a great one: "I implement per-user per-channel rate limits in Redis to prevent notification spam. I use idempotency keys for deduplication to prevent duplicates from Kafka retries. I batch high-frequency events into aggregated notifications ('48 people liked your photo') using a 5-minute buffer window." Interviewers love this level of detail.

STEP 5

Push Notification Delivery

Push delivery flow — Notification Service → Device Registry → Platform Router → APNS/FCM → User's phone
Figure 6: Push delivery flow — Notification Service → Device Registry → Platform Router → APNS/FCM → User's phone

Push notification delivery requires a device token registry. When a user installs the app, the device registers its push token (APNS token for iOS, FCM token for Android) with the backend. These tokens are stored in Redis: device_tokens:{user_id} = set of tokens.

Users with multiple devices (phone + tablet) have multiple tokens. The push worker looks up all tokens for the user, routes to the correct platform (iOS → Apple APNS, Android → Google FCM), and sends. If APNS/FCM returns an "invalid token" error, the token is removed from the registry (user uninstalled the app or the token was refreshed).

STEP 6

Per-Channel Retry & DLQ Strategy

Each channel has different retry strategies based on its provider's error types and characteristics
Figure 7: Each channel has different retry strategies based on its provider's error types and characteristics

Each delivery channel has its own retry strategy because failure modes differ drastically. Push retries are fast because APNS/FCM responses are immediate. Email retries are slow because email delivery is inherently delayed. In-app uses store-and-forward: if the user is offline, the notification is stored in a Redis list and delivered when the user's WebSocket reconnects.

Channel Max Retries Backoff Permanent Errors → DLQ Provider
Push 3 1s, 5s, 15s Invalid token (remove + DLQ) APNS (iOS), FCM (Android)
Email 5 30s, 60s, 5m, 30m, 1h Hard bounce (mark invalid + DLQ) SendGrid, AWS SES
SMS 3 5s, 30s, 5m Invalid number (DLQ) Twilio, AWS SNS
In-App 2 Store + forward N/A (persisted in Redis) WebSocket, SSE
STEP 7

Analytics & Monitoring

Notification funnel — track created, filtered, sent, delivered, opened/clicked at each stage
Figure 8: Notification funnel — track created → filtered → sent → delivered → opened/clicked at each stage

Every notification passes through a tracking funnel: Created (event received) → Filtered (rate-limited or preference-blocked) → Sent (enqueued to provider) → Delivered (provider confirms) → Opened (user tapped) → Clicked (user interacted). Each stage publishes a tracking event to a Kafka analytics topic, flowing to ClickHouse for real-time dashboards.

MetricHealthyWarningAction
Delivery rate95%+<90%Check provider errors, DLQ
Open rate (push)5–15%<3%Review notification content/timing
Bounce rate (email)<2%2–5%Clean email list, check sender reputation
DLQ depth01–10Fix consumer bug, replay from DLQ
Latency (event to delivery)<1 second1–5 secondsScale workers, check provider latency
STEP 8

Design Checklist

12-point checklist covering every aspect of the notification system design
Figure 9: 12-point checklist covering every aspect of the notification system design
AspectDesign DecisionWhy
Event BusKafka: notification.events topicDecouple trigger services from notification logic
Notification ServiceConsumes events, routes to channel queuesCentral orchestration: prefs, rate limit, dedup, template
Channel Queues4 separate queues: push, email, SMS, in-appIndependent scaling, retry, and DLQ per channel
User PreferencesRedis: per-user per-event-type channel matrixSub-ms lookup. Users control their notification experience.
Rate LimitingRedis counters: push 10/hr, email 5/day, SMS 3/dayPrevent notification fatigue and provider rate limits
DeduplicationIdempotency key in Redis SET (24h TTL)Prevent duplicates from retries and duplicate events
Batching5-min window: '48 people liked your photo'90% volume reduction for high-frequency events
Quiet HoursDelay push/SMS during DND. Email unaffected.Respect user attention. OTP/security bypass DND.
RetriesExponential backoff per channel. Smart error classification.Transient errors retry. Permanent errors go to DLQ immediately.
DLQPer-channel DLQ. Alert on any message. Replay tooling.Safety net. No notification permanently lost.
Device RegistryRedis SET: device_tokens per user. Invalidate stale.Multi-device support. Clean up uninstalled apps.
AnalyticsKafka → ClickHouse: created/sent/delivered/opened funnelReal-time dashboards. Measure notification effectiveness.
This Template Applies to Any Notification System

This design is the template for WhatsApp notifications, Slack alerts, Uber ride updates, DoorDash order tracking, and any multi-channel notification platform. The patterns are identical: Kafka event bus, central routing service, per-channel queues, user preferences, rate limiting, deduplication, batching, per-channel retry/DLQ, and delivery analytics. Master this design and apply it to any notification interview question.

Track Your DSA Progress — It's Free

Stop solving random questions. Start with the right 206 questions across 16 patterns — structured, curated, and completely free.

206 curated questions 16 patterns covered Google login · Free forever
Create Free Account →