Topic 1

Message Queues: Deep Dive

01

Consumer Lifecycle: ACK, NACK, and Redelivery

The consumer lifecycle determines how messages flow through the system and what happens when things go wrong. Understanding this cycle is critical because it governs data safety: a message must not be lost and must not be processed twice unless the consumer is idempotent. The ACK mechanism is what makes this possible.

Consumer lifecycle — ACK, NACK, visibility timeout and redelivery
Figure 1: Happy path — consumer processes and ACKs, message removed. Failure path — NACK or timeout returns message to queue for redelivery.

The ACK Contract: When a consumer receives a message, it enters an "in-flight" state. The message is invisible to other consumers during a visibility timeout (SQS: default 30s, RabbitMQ: until ACK/NACK, Kafka: until offset committed). The consumer processes the message and sends:

  • ACK (success) — broker removes the message permanently.
  • NACK (failure) — broker requeues the message for redelivery to another consumer.
  • Timeout (crash/no response) — visibility timeout expires, message becomes visible again automatically.
Dangerous Pattern: ACK Before Processing

Never ACK a message before processing is complete. If you ACK and then crash during processing, the message is permanently lost — the broker has already removed it.

The correct pattern: receive → process → commit side effects → ACK. In Kafka, this means committing the offset AFTER processing, not before. When to NACK: when processing fails but the message is valid (transient error like timeout or rate limit). If the failure is permanent (invalid data), send to the Dead Letter Queue instead of NACKing endlessly.

02

Competing Consumers: Scaling with Workers

Competing consumers — multiple workers sharing a queue for parallel processing
Figure 2: Competing consumers — 3 workers share the queue, each getting different messages. Queue auto-load-balances. Add workers to scale.

The competing consumers pattern distributes work across multiple worker instances. Each message is delivered to exactly one worker. The queue automatically load-balances: if Worker 1 is slow processing a heavy task, Workers 2 and 3 pick up the remaining messages. This is the fundamental scaling mechanism for message queues.

Auto-Scaling Based on Queue Depth: The queue depth (number of messages waiting) is the key metric for scaling consumers. In AWS, configure SQS queue depth to trigger auto-scaling: queue depth > 1,000 → add 2 workers; queue depth < 100 → remove 1 worker. This creates an elastic processing pipeline that scales with demand.

Queue Depth
< 100
100–1K
1K+ (behind)
Consumer Lag
< 100 offsets
100–1K
1K+ (too slow)
Process Time
< 1s / msg
1–10s
10s+ (bottleneck)
DLQ Depth
0
1–10 msgs
10+ (systematic)
Interview Tip: Mention Queue Monitoring

'I set up CloudWatch alarms on SQS queue depth. If messages exceed 1,000, auto-scaling adds workers. I also monitor the DLQ — any messages there trigger a PagerDuty alert because it means our consumer has a bug or a dependency is permanently down.' This shows production readiness that most candidates miss.

Topic 2

Pub/Sub: Fan-Out in Practice

03

Kafka Consumer Groups: Pub/Sub + Work Queue in One

Kafka's consumer group model is elegantly powerful: it combines pub/sub (multiple consumer groups each get all messages) with work queue (within a group, partitions are distributed across consumers). A single Kafka topic can serve both patterns simultaneously — which is why Kafka has become the default messaging system for system design interviews.

Kafka consumer groups — three independent groups consuming the same orders topic
Figure 3: Three consumer groups (Email, Analytics, Inventory) independently consume all messages from the same "orders" topic. Within each group, partitions are distributed across consumers.

How It Works: The Order Service publishes an order.created event to the "orders" topic (3 partitions). Three consumer groups are subscribed: Email Service (2 consumers), Analytics Service (2 consumers), and Inventory Service (3 consumers). Each group independently receives ALL messages. Within the Email Service group, partitions are split across consumers for parallel processing.

Adding a New Consumer: Zero Producer Changes. When the team builds a Fraud Detection Service, they simply create a new consumer group and subscribe to the "orders" topic. The Order Service does not change at all — it has no knowledge of its consumers. The new service independently processes all past and future events from its own offset. This is the core benefit of event-driven architecture: extend the system without modifying existing services.

Uber's Event-Driven Platform

Uber's event bus handles 1 trillion+ events per day across thousands of Kafka topics. When a ride is completed, the trip.completed event is consumed by: billing (charge rider), driver payment (pay driver), analytics (metrics dashboard), ML models (demand prediction), surge pricing (update multiplier), and customer support (populate ride history).

One event, six independent consumers. Adding the seventh requires zero changes to the trip service.

04

Pub/Sub Anti-Patterns to Avoid

Anti-PatternProblemSolution
Oversized messagesLarge payloads (>1MB) slow the broker and consumersStore data in S3/DB; put reference (URL/ID) in message
Chatty publishersPublishing on every minor state change floods the topicBatch events or debounce (publish once per second per entity)
No schema evolutionChanging message format breaks all consumersUse schema registry (Avro/Protobuf). Backward-compatible changes.
Ignoring consumer lagConsumers fall behind without anyone noticingAlert on consumer lag > 1,000 offsets per partition
Unbounded topicsTopic grows forever, disk fills upSet retention period (7 days) or compaction policy

Topic 3

Backoff & Retry Strategies

05

Exponential Backoff with Jitter: The Gold Standard

When a message fails to process, the consumer must decide how long to wait before retrying. Retrying immediately is dangerous: if the downstream service is overloaded, immediate retries add more load, making the problem worse. Exponential backoff solves this by doubling the wait time with each retry, giving the failing service time to recover. Adding random jitter prevents synchronized retry storms when many consumers retry at the same exponential intervals.

Exponential backoff timeline — wait doubles each retry with random jitter
Figure 4: Exponential backoff — wait time doubles each retry (1s → 2s → 4s → 8s) with random jitter added to prevent synchronized retry storms
~1.3s
Retry 1
~2.7s
Retry 2
~4.4s
Retry 3
~8.9s
Retry 4
DLQ
Retry 5

The Formula: wait_time = min(base_delay × 2^attempt + random(0, jitter_max), max_delay)

With base_delay=1s, jitter_max=500ms, max_delay=60s: Retry 1 waits ~1.3s, Retry 2 waits ~2.7s, Retry 3 waits ~4.4s, Retry 4 waits ~8.9s. After max_retries (typically 3–5), the message goes to the DLQ. The cap at max_delay (60s) prevents absurdly long waits.

06

Four Retry Strategies Compared

Four retry strategies — immediate, fixed interval, exponential backoff, backoff with jitter
Figure 5: Four retry strategies ranked — from immediate retry (dangerous for overloaded services) to exponential backoff + jitter (production standard)
StrategyWait PatternRetry Storm RiskWhen to Use
Immediate retry0, 0, 0, 0Very highOnly for idempotent, instant-recovery errors
Fixed interval5s, 5s, 5s, 5sMediumSimple systems with low concurrency
Exponential backoff1s, 2s, 4s, 8sLowAny distributed system
Backoff + jitter1.3s, 2.7s, 4.1s, 8.9sVery lowProduction standard (AWS SDK default)
Implementing Retries in Kafka

Kafka does not have built-in retry/delay mechanisms like SQS. The common pattern is retry topics: on failure, the consumer publishes the message to a retry topic (orders.retry.1, orders.retry.2) with increasing delays. A separate consumer reads from retry topics and re-publishes to the main topic after the delay. After max retries, the message goes to orders.dlq.

This pattern is used by Uber, LinkedIn, and Netflix in production.

AWS SQS Retry Implementation

SQS has built-in retry support via the visibility timeout. When a consumer does not ACK within the timeout, SQS automatically makes the message visible again. The ApproximateReceiveCount attribute tracks retries. Configure a redrive policy: after N receives (e.g., 5), SQS automatically moves the message to the Dead Letter Queue. No application-level retry logic needed.

Interview Tip: Specify Retry Parameters

'I implement exponential backoff with jitter. Base delay 1 second, multiplier 2, max delay 60 seconds, max retries 5. After 5 failures, the message goes to the DLQ. I classify errors: transient (timeout, 429, 503) trigger retries. Permanent (400, schema error) skip retries and go directly to DLQ to avoid wasting resources.' This level of specificity impresses interviewers.

Topic 4

Dead Letter Queue Deep Dive

07

The Complete DLQ Pipeline

A Dead Letter Queue is the safety net for your messaging system. It catches messages that cannot be processed after all retry attempts are exhausted. Without a DLQ, failed messages are either lost (at-most-once) or retried infinitely — poisoning the main queue. The DLQ provides a holding area where failed messages can be inspected, root cause identified, bug fixed, and messages replayed.

Complete DLQ pipeline — main queue, consumer, retry with backoff, DLQ, alert, replay
Figure 6: Complete DLQ pipeline — main queue → consumer → retry with exponential backoff → max retries exceeded → DLQ → alert + dashboard + fix + replay

DLQ Best Practices:

  • Naming convention: append .dlq to the main queue name. ordersorders.dlq. Easy to trace which queue a failed message came from.
  • Preserve metadata: when moving to DLQ, attach original queue name, error message, stack trace, retry count, and timestamp of last attempt. This context is essential for debugging.
  • Alert immediately: any message in the DLQ means something is broken. Set up a PagerDuty/Slack alert when DLQ depth > 0. Do not let messages rot unnoticed.
  • Build a DLQ dashboard: showing message count, age of oldest message, error distribution, and trend over time.
  • Build replay tooling: the recovery path is fix code → deploy → replay DLQ messages → verify.
An Unmonitored DLQ Is a Graveyard

A growing DLQ is a sign that something in the system is broken. Set up alerts: if the DLQ receives more than N messages per hour, page the on-call engineer. Regularly review DLQ messages — they reveal bugs, data quality issues, and integration problems that you would never otherwise discover.

08

Smart Retry: Classify Before You Retry

Retriable vs non-retriable errors — transient errors retry, permanent errors go to DLQ immediately
Figure 7: Error classification — transient errors (timeout, 429, 503) are worth retrying. Permanent errors (400, invalid schema) should go directly to DLQ.

Not all errors are worth retrying. A network timeout is transient — the next attempt will likely succeed. An invalid JSON payload is permanent — retrying 5 times just wastes resources and delays other messages. Smart consumers classify errors before deciding whether to retry or DLQ.

Error TypeExamplesActionRationale
Transient (retriable)Timeout, 429 (rate limited), 503 (unavailable), connection refusedRetry with backoffService is temporarily unhealthy; will recover
Permanent (non-retriable)400 (bad request), invalid JSON, schema mismatch, business rule violationDLQ immediatelyRetrying will never succeed; fix the message or consumer
UnknownUnexpected 500, unhandled exceptionRetry 1–2 times, then DLQMight be transient; if not, DLQ quickly
The Poison Message Problem

A 'poison message' is a message that crashes the consumer every time it is processed. Without smart retry logic, the consumer receives it → crashes → message is redelivered → consumer crashes again — an infinite loop. The consumer never processes other messages in the queue.

Solution: track per-message retry count. After N crashes on the same message, move it to DLQ and continue processing the queue. Never let one bad message block the entire pipeline.

Topic 5

Event-Driven Architecture

09

Services Communicate via Events, Not Calls

In an event-driven architecture (EDA), services communicate by publishing and consuming events through a central event bus (typically Kafka). A service publishes events about what happened (order.created, payment.succeeded) without knowing or caring who will consume them. This is fundamentally different from request-driven architecture where Service A directly calls Service B — creating tight coupling, synchronous blocking, and cascading failures.

Event-driven architecture — publishers and consumers communicate through Kafka with zero direct coupling
Figure 8: Event-driven architecture — 4 publishers and 4 consumers communicate through Kafka. No service has direct knowledge of another. Zero coupling.

The Three Rules of EDA:

  1. Events describe facts: "Order #42 was created" is a fact, not a command. It does not tell anyone what to do. Each consumer decides how to react independently.
  2. Publishers are ignorant: the Order Service does not know that Email, Analytics, and Notification services exist. It publishes the event and moves on. This enables independent development and deployment.
  3. Consumers are autonomous: each consumer maintains its own state, processes at its own pace, and can fail without affecting other consumers or the publisher.
Event Naming Conventions
PatternExampleWhen to Use
entity.action (past tense)order.created, payment.failedStandard for domain events (most common)
domain.entity.actionecommerce.order.createdWhen multiple domains share the same bus
entity.action.versionorder.created.v2When evolving event schemas (breaking changes)
10

The Saga Pattern: Distributed Transactions

In a monolithic system, placing an order is one database transaction: decrement inventory, create order, charge payment, send email — all or nothing (ACID). In microservices, each step is a separate service with its own database. You cannot use a single ACID transaction across services. The Saga pattern solves this by breaking the transaction into a sequence of local transactions, each publishing an event. If a step fails, compensating events undo previous steps.

Saga pattern — happy path and failure path with compensating transactions
Figure 9: Saga pattern — happy path completes all 5 steps. Failure at inventory triggers compensating actions: refund payment, cancel order.

How Saga Compensation Works (Order Example):

  • 1
    Order Service creates the order, publishes order.created. ✅
  • 2
    Payment Service charges the card, publishes payment.succeeded. ✅
  • 3
    Inventory Service tries to reserve stock but FAILS (out of stock). Publishes inventory.failed. ❌
  • Payment Service receives inventory.failed, issues a refund. Publishes payment.refunded.
  • Order Service receives payment.refunded, cancels the order. Publishes order.cancelled.

The end state is consistent: no money taken, no inventory reserved, order cancelled. But this consistency is eventual, not immediate — there is a brief window where payment has been charged but inventory has not yet been reserved. This is the trade-off of Saga vs ACID.

AspectACID TransactionSaga Pattern
ScopeSingle databaseMultiple services / databases
ConsistencyImmediate (strong)Eventual (compensating actions)
RollbackAutomatic (DB ROLLBACK)Manual (compensating events)
IsolationGuaranteed (DB locks)Not guaranteed (intermediate states visible)
ComplexityLow (one transaction)Higher (orchestrate multiple steps)
ScalabilityLimited (single DB)High (independent services)
Used byMonoliths, single-service opsMicroservices: Uber, Amazon, Stripe
Interview Tip: Use Saga for Cross-Service Transactions

'Since these are separate microservices with separate databases, I cannot use a single ACID transaction. I implement the Saga pattern: each service performs its local transaction and publishes an event. If any step fails, compensating events undo the previous steps. This gives me eventual consistency across services.' Mention that you track saga state for debugging and idempotency on compensating actions.

Event-driven vs request-driven architecture comparison
Figure 10: Request-driven — Order directly calls 3 services (tight coupling, cascade failures). Event-driven — Order publishes one event, 3 services react independently (loose coupling, isolated failures).
AspectRequest-Driven (Sync)Event-Driven (Async)
CouplingTight (producer knows consumers)Loose (producer knows only the bus)
Adding consumersModify producer code + deployZero producer changes
Failure isolationCascade (one failure breaks chain)Isolated (failures contained per service)
Latency for callerSum of all service latenciesJust queue write (~5ms)
DebuggingEasy (follow HTTP call chain)Harder (trace event flow across services)
TransactionsACID possible (single DB)Saga pattern (eventual consistency)
ScalabilityLimited by slowest serviceEach service scales independently
Best forSimple CRUD, low service countComplex systems, many services, high scale

Class Summary

Five Topics, One Coherent System

Message Queues (Work Queue): Consumers ACK after processing. NACK or timeout returns message to queue. Competing consumers distribute work. Scale based on queue depth. ACK only after processing is complete — never before.

Pub/Sub (Fan-Out): Kafka consumer groups give you both pub/sub (each group gets all messages) and work queue (within a group, partitions are distributed). One topic, multiple independent consumer groups, zero coupling. Adding consumers requires zero producer changes.

Backoff & Retries: Always use exponential backoff with jitter. Never retry immediately. Classify errors: transient (retry) vs permanent (DLQ immediately). Cap retries at 3–5 with max_delay of 60 seconds. AWS SDK, gRPC, and Stripe all use this pattern.

Dead Letter Queue: Safety net for failed messages. Append .dlq to queue name. Preserve error metadata. Alert on any DLQ messages. Build replay tooling. Monitor DLQ depth as a health metric. Prevent poison messages with per-message retry tracking.

Event-Driven Architecture: Services publish facts (events) to Kafka. No service knows about its consumers. Each consumer is autonomous. Saga pattern handles distributed transactions via compensating actions. Trade-off: eventual consistency + harder debugging for loose coupling + independent scaling.

Track Your DSA Progress — It's Free

Stop solving random questions. Start with the right 206 questions across 16 patterns — structured, curated, and completely free.

206 curated questions 16 patterns covered Google login · Free forever
Create Free Account →