Message Queues, Pub/Sub & Event-Driven Architecture: Complete System Design Guide

What's Inside

01Consumer ACK / NACK Lifecycle

02Competing Consumers & Scaling

03Kafka Pub/Sub: Fan-Out in Practice

04Pub/Sub Anti-Patterns

05Exponential Backoff with Jitter

06Retry Strategies Compared

07Dead Letter Queue Deep Dive

08Smart Error Classification

09Event-Driven Architecture

10Saga Pattern: Distributed Transactions

Topic 1

Message Queues: Deep Dive

Consumer Lifecycle: ACK, NACK, and Redelivery

The consumer lifecycle determines how messages flow through the system and what happens when things go wrong. Understanding this cycle is critical because it governs data safety: a message must not be lost and must not be processed twice unless the consumer is idempotent. The ACK mechanism is what makes this possible.

Consumer lifecycle — ACK, NACK, visibility timeout and redelivery

Figure 1: Happy path — consumer processes and ACKs, message removed. Failure path — NACK or timeout returns message to queue for redelivery.

The ACK Contract: When a consumer receives a message, it enters an "in-flight" state. The message is invisible to other consumers during a visibility timeout (SQS: default 30s, RabbitMQ: until ACK/NACK, Kafka: until offset committed). The consumer processes the message and sends:

ACK (success) — broker removes the message permanently.
NACK (failure) — broker requeues the message for redelivery to another consumer.
Timeout (crash/no response) — visibility timeout expires, message becomes visible again automatically.

Dangerous Pattern: ACK Before Processing

Never ACK a message before processing is complete. If you ACK and then crash during processing, the message is permanently lost — the broker has already removed it.

The correct pattern: receive → process → commit side effects → ACK. In Kafka, this means committing the offset AFTER processing, not before. When to NACK: when processing fails but the message is valid (transient error like timeout or rate limit). If the failure is permanent (invalid data), send to the Dead Letter Queue instead of NACKing endlessly.

Competing Consumers: Scaling with Workers

Competing consumers — multiple workers sharing a queue for parallel processing

Figure 2: Competing consumers — 3 workers share the queue, each getting different messages. Queue auto-load-balances. Add workers to scale.

The competing consumers pattern distributes work across multiple worker instances. Each message is delivered to exactly one worker. The queue automatically load-balances: if Worker 1 is slow processing a heavy task, Workers 2 and 3 pick up the remaining messages. This is the fundamental scaling mechanism for message queues.

Auto-Scaling Based on Queue Depth: The queue depth (number of messages waiting) is the key metric for scaling consumers. In AWS, configure SQS queue depth to trigger auto-scaling: queue depth > 1,000 → add 2 workers; queue depth < 100 → remove 1 worker. This creates an elastic processing pipeline that scales with demand.

Queue Depth

< 100

100–1K

1K+ (behind)

Consumer Lag

< 100 offsets

100–1K

1K+ (too slow)

Process Time

< 1s / msg

1–10s

10s+ (bottleneck)

DLQ Depth

1–10 msgs

10+ (systematic)

Interview Tip: Mention Queue Monitoring

'I set up CloudWatch alarms on SQS queue depth. If messages exceed 1,000, auto-scaling adds workers. I also monitor the DLQ — any messages there trigger a PagerDuty alert because it means our consumer has a bug or a dependency is permanently down.' This shows production readiness that most candidates miss.

Topic 2

Pub/Sub: Fan-Out in Practice

Kafka Consumer Groups: Pub/Sub + Work Queue in One

Kafka's consumer group model is elegantly powerful: it combines pub/sub (multiple consumer groups each get all messages) with work queue (within a group, partitions are distributed across consumers). A single Kafka topic can serve both patterns simultaneously — which is why Kafka has become the default messaging system for system design interviews.

Kafka consumer groups — three independent groups consuming the same orders topic

Figure 3: Three consumer groups (Email, Analytics, Inventory) independently consume all messages from the same "orders" topic. Within each group, partitions are distributed across consumers.

How It Works: The Order Service publishes an order.created event to the "orders" topic (3 partitions). Three consumer groups are subscribed: Email Service (2 consumers), Analytics Service (2 consumers), and Inventory Service (3 consumers). Each group independently receives ALL messages. Within the Email Service group, partitions are split across consumers for parallel processing.

Adding a New Consumer: Zero Producer Changes. When the team builds a Fraud Detection Service, they simply create a new consumer group and subscribe to the "orders" topic. The Order Service does not change at all — it has no knowledge of its consumers. The new service independently processes all past and future events from its own offset. This is the core benefit of event-driven architecture: extend the system without modifying existing services.

Uber's Event-Driven Platform

Uber's event bus handles 1 trillion+ events per day across thousands of Kafka topics. When a ride is completed, the trip.completed event is consumed by: billing (charge rider), driver payment (pay driver), analytics (metrics dashboard), ML models (demand prediction), surge pricing (update multiplier), and customer support (populate ride history).

One event, six independent consumers. Adding the seventh requires zero changes to the trip service.

Pub/Sub Anti-Patterns to Avoid

Anti-Pattern	Problem	Solution
Oversized messages	Large payloads (>1MB) slow the broker and consumers	Store data in S3/DB; put reference (URL/ID) in message
Chatty publishers	Publishing on every minor state change floods the topic	Batch events or debounce (publish once per second per entity)
No schema evolution	Changing message format breaks all consumers	Use schema registry (Avro/Protobuf). Backward-compatible changes.
Ignoring consumer lag	Consumers fall behind without anyone noticing	Alert on consumer lag > 1,000 offsets per partition
Unbounded topics	Topic grows forever, disk fills up	Set retention period (7 days) or compaction policy

Topic 3

Backoff & Retry Strategies

Exponential Backoff with Jitter: The Gold Standard

When a message fails to process, the consumer must decide how long to wait before retrying. Retrying immediately is dangerous: if the downstream service is overloaded, immediate retries add more load, making the problem worse. Exponential backoff solves this by doubling the wait time with each retry, giving the failing service time to recover. Adding random jitter prevents synchronized retry storms when many consumers retry at the same exponential intervals.

Exponential backoff timeline — wait doubles each retry with random jitter

Figure 4: Exponential backoff — wait time doubles each retry (1s → 2s → 4s → 8s) with random jitter added to prevent synchronized retry storms

~1.3s

Retry 1

~2.7s

Retry 2

~4.4s

Retry 3

~8.9s

Retry 4

DLQ

Retry 5

The Formula: wait_time = min(base_delay × 2^attempt + random(0, jitter_max), max_delay)

With base_delay=1s, jitter_max=500ms, max_delay=60s: Retry 1 waits ~1.3s, Retry 2 waits ~2.7s, Retry 3 waits ~4.4s, Retry 4 waits ~8.9s. After max_retries (typically 3–5), the message goes to the DLQ. The cap at max_delay (60s) prevents absurdly long waits.

Four Retry Strategies Compared

Four retry strategies — immediate, fixed interval, exponential backoff, backoff with jitter

Figure 5: Four retry strategies ranked — from immediate retry (dangerous for overloaded services) to exponential backoff + jitter (production standard)

Strategy	Wait Pattern	Retry Storm Risk	When to Use
Immediate retry	0, 0, 0, 0	Very high	Only for idempotent, instant-recovery errors
Fixed interval	5s, 5s, 5s, 5s	Medium	Simple systems with low concurrency
Exponential backoff	1s, 2s, 4s, 8s	Low	Any distributed system
Backoff + jitter	1.3s, 2.7s, 4.1s, 8.9s	Very low	Production standard (AWS SDK default)

Implementing Retries in Kafka

Kafka does not have built-in retry/delay mechanisms like SQS. The common pattern is retry topics: on failure, the consumer publishes the message to a retry topic (orders.retry.1, orders.retry.2) with increasing delays. A separate consumer reads from retry topics and re-publishes to the main topic after the delay. After max retries, the message goes to orders.dlq.

This pattern is used by Uber, LinkedIn, and Netflix in production.

AWS SQS Retry Implementation

SQS has built-in retry support via the visibility timeout. When a consumer does not ACK within the timeout, SQS automatically makes the message visible again. The ApproximateReceiveCount attribute tracks retries. Configure a redrive policy: after N receives (e.g., 5), SQS automatically moves the message to the Dead Letter Queue. No application-level retry logic needed.

Interview Tip: Specify Retry Parameters

'I implement exponential backoff with jitter. Base delay 1 second, multiplier 2, max delay 60 seconds, max retries 5. After 5 failures, the message goes to the DLQ. I classify errors: transient (timeout, 429, 503) trigger retries. Permanent (400, schema error) skip retries and go directly to DLQ to avoid wasting resources.' This level of specificity impresses interviewers.

Topic 4

Dead Letter Queue Deep Dive

The Complete DLQ Pipeline

A Dead Letter Queue is the safety net for your messaging system. It catches messages that cannot be processed after all retry attempts are exhausted. Without a DLQ, failed messages are either lost (at-most-once) or retried infinitely — poisoning the main queue. The DLQ provides a holding area where failed messages can be inspected, root cause identified, bug fixed, and messages replayed.

Complete DLQ pipeline — main queue, consumer, retry with backoff, DLQ, alert, replay

Figure 6: Complete DLQ pipeline — main queue → consumer → retry with exponential backoff → max retries exceeded → DLQ → alert + dashboard + fix + replay

DLQ Best Practices:

Naming convention: append .dlq to the main queue name. orders → orders.dlq. Easy to trace which queue a failed message came from.
Preserve metadata: when moving to DLQ, attach original queue name, error message, stack trace, retry count, and timestamp of last attempt. This context is essential for debugging.
Alert immediately: any message in the DLQ means something is broken. Set up a PagerDuty/Slack alert when DLQ depth > 0. Do not let messages rot unnoticed.
Build a DLQ dashboard: showing message count, age of oldest message, error distribution, and trend over time.
Build replay tooling: the recovery path is fix code → deploy → replay DLQ messages → verify.

An Unmonitored DLQ Is a Graveyard

A growing DLQ is a sign that something in the system is broken. Set up alerts: if the DLQ receives more than N messages per hour, page the on-call engineer. Regularly review DLQ messages — they reveal bugs, data quality issues, and integration problems that you would never otherwise discover.

Smart Retry: Classify Before You Retry

Retriable vs non-retriable errors — transient errors retry, permanent errors go to DLQ immediately

Figure 7: Error classification — transient errors (timeout, 429, 503) are worth retrying. Permanent errors (400, invalid schema) should go directly to DLQ.

Not all errors are worth retrying. A network timeout is transient — the next attempt will likely succeed. An invalid JSON payload is permanent — retrying 5 times just wastes resources and delays other messages. Smart consumers classify errors before deciding whether to retry or DLQ.

Error Type	Examples	Action	Rationale
Transient (retriable)	Timeout, 429 (rate limited), 503 (unavailable), connection refused	Retry with backoff	Service is temporarily unhealthy; will recover
Permanent (non-retriable)	400 (bad request), invalid JSON, schema mismatch, business rule violation	DLQ immediately	Retrying will never succeed; fix the message or consumer
Unknown	Unexpected 500, unhandled exception	Retry 1–2 times, then DLQ	Might be transient; if not, DLQ quickly

The Poison Message Problem

A 'poison message' is a message that crashes the consumer every time it is processed. Without smart retry logic, the consumer receives it → crashes → message is redelivered → consumer crashes again — an infinite loop. The consumer never processes other messages in the queue.

Solution: track per-message retry count. After N crashes on the same message, move it to DLQ and continue processing the queue. Never let one bad message block the entire pipeline.

Topic 5

Event-Driven Architecture

Services Communicate via Events, Not Calls

In an event-driven architecture (EDA), services communicate by publishing and consuming events through a central event bus (typically Kafka). A service publishes events about what happened (order.created, payment.succeeded) without knowing or caring who will consume them. This is fundamentally different from request-driven architecture where Service A directly calls Service B — creating tight coupling, synchronous blocking, and cascading failures.

Event-driven architecture — publishers and consumers communicate through Kafka with zero direct coupling

Figure 8: Event-driven architecture — 4 publishers and 4 consumers communicate through Kafka. No service has direct knowledge of another. Zero coupling.

The Three Rules of EDA:

Events describe facts: "Order #42 was created" is a fact, not a command. It does not tell anyone what to do. Each consumer decides how to react independently.
Publishers are ignorant: the Order Service does not know that Email, Analytics, and Notification services exist. It publishes the event and moves on. This enables independent development and deployment.
Consumers are autonomous: each consumer maintains its own state, processes at its own pace, and can fail without affecting other consumers or the publisher.

Event Naming Conventions

Pattern	Example	When to Use
`entity.action` (past tense)	order.created, payment.failed	Standard for domain events (most common)
`domain.entity.action`	ecommerce.order.created	When multiple domains share the same bus
`entity.action.version`	order.created.v2	When evolving event schemas (breaking changes)

The Saga Pattern: Distributed Transactions

In a monolithic system, placing an order is one database transaction: decrement inventory, create order, charge payment, send email — all or nothing (ACID). In microservices, each step is a separate service with its own database. You cannot use a single ACID transaction across services. The Saga pattern solves this by breaking the transaction into a sequence of local transactions, each publishing an event. If a step fails, compensating events undo previous steps.

Saga pattern — happy path and failure path with compensating transactions

Figure 9: Saga pattern — happy path completes all 5 steps. Failure at inventory triggers compensating actions: refund payment, cancel order.

How Saga Compensation Works (Order Example):

1

Order Service creates the order, publishes order.created. ✅
2

Payment Service charges the card, publishes payment.succeeded. ✅
3

Inventory Service tries to reserve stock but FAILS (out of stock). Publishes inventory.failed. ❌
↩

Payment Service receives inventory.failed, issues a refund. Publishes payment.refunded.
↩

Order Service receives payment.refunded, cancels the order. Publishes order.cancelled.

The end state is consistent: no money taken, no inventory reserved, order cancelled. But this consistency is eventual, not immediate — there is a brief window where payment has been charged but inventory has not yet been reserved. This is the trade-off of Saga vs ACID.

Aspect	ACID Transaction	Saga Pattern
Scope	Single database	Multiple services / databases
Consistency	Immediate (strong)	Eventual (compensating actions)
Rollback	Automatic (DB ROLLBACK)	Manual (compensating events)
Isolation	Guaranteed (DB locks)	Not guaranteed (intermediate states visible)
Complexity	Low (one transaction)	Higher (orchestrate multiple steps)
Scalability	Limited (single DB)	High (independent services)
Used by	Monoliths, single-service ops	Microservices: Uber, Amazon, Stripe

Interview Tip: Use Saga for Cross-Service Transactions

'Since these are separate microservices with separate databases, I cannot use a single ACID transaction. I implement the Saga pattern: each service performs its local transaction and publishes an event. If any step fails, compensating events undo the previous steps. This gives me eventual consistency across services.' Mention that you track saga state for debugging and idempotency on compensating actions.

Event-driven vs request-driven architecture comparison

Figure 10: Request-driven — Order directly calls 3 services (tight coupling, cascade failures). Event-driven — Order publishes one event, 3 services react independently (loose coupling, isolated failures).

Aspect	Request-Driven (Sync)	Event-Driven (Async)
Coupling	Tight (producer knows consumers)	Loose (producer knows only the bus)
Adding consumers	Modify producer code + deploy	Zero producer changes
Failure isolation	Cascade (one failure breaks chain)	Isolated (failures contained per service)
Latency for caller	Sum of all service latencies	Just queue write (~5ms)
Debugging	Easy (follow HTTP call chain)	Harder (trace event flow across services)
Transactions	ACID possible (single DB)	Saga pattern (eventual consistency)
Scalability	Limited by slowest service	Each service scales independently
Best for	Simple CRUD, low service count	Complex systems, many services, high scale

Class Summary

Five Topics, One Coherent System

Message Queues (Work Queue): Consumers ACK after processing. NACK or timeout returns message to queue. Competing consumers distribute work. Scale based on queue depth. ACK only after processing is complete — never before.

Pub/Sub (Fan-Out): Kafka consumer groups give you both pub/sub (each group gets all messages) and work queue (within a group, partitions are distributed). One topic, multiple independent consumer groups, zero coupling. Adding consumers requires zero producer changes.

Backoff & Retries: Always use exponential backoff with jitter. Never retry immediately. Classify errors: transient (retry) vs permanent (DLQ immediately). Cap retries at 3–5 with max_delay of 60 seconds. AWS SDK, gRPC, and Stripe all use this pattern.

Dead Letter Queue: Safety net for failed messages. Append .dlq to queue name. Preserve error metadata. Alert on any DLQ messages. Build replay tooling. Monitor DLQ depth as a health metric. Prevent poison messages with per-message retry tracking.

Event-Driven Architecture: Services publish facts (events) to Kafka. No service knows about its consumers. Each consumer is autonomous. Saga pattern handles distributed transactions via compensating actions. Trade-off: eventual consistency + harder debugging for loose coupling + independent scaling.

Want to Land at Google, Microsoft or Apple?

Watch Pranjal Jain's free 30-min training — the exact GROW Strategy that helped 1,572+ engineers go from TCS/Infosys to top product companies with a 3–5X salary hike.

DSA + System Design roadmap 1:1 mentorship from ex-Microsoft 1,572+ placed · 4.9★ rated

Watch Free Training →

Kafka & Async Messaging Post-Class

Message Queues, Pub/Sub & Event-Driven Architecture
Complete System Design Guide