OVERVIEW

The Four Hardest Problems in Chat Design

Designing a chat system at WhatsApp's scale is one of the most common system design interview questions — and the one where candidates lose the most points. Requirements and estimation were covered in Class 11 Pre-Class. Today we go deep into the four problems that interviewers specifically probe for: Real-Time Messaging Architecture, Message Ordering, Offline Sync, and Multi-Region Replication.

Concurrent Users
200M
~4,000 WebSocket gateway servers needed
Message Rate
1.2M/s
Cassandra writes at LOCAL_QUORUM
Connections/Server
50K
Each WebSocket gateway server
Cross-Region Latency
~300ms
India → US end-to-end message delivery
Deep Dive 1

Real-Time Messaging Architecture

TRANSPORT

WebSocket vs Long Polling: Why WebSocket Wins

Real-time messaging requires the server to push messages to the client the instant they arrive — the client cannot be polling. Three approaches exist: short polling (wasteful), long polling (acceptable), and WebSocket (optimal). WhatsApp, Slack, Discord, and every major chat system use WebSocket because it provides true bidirectional, full-duplex communication over a single persistent TCP connection.

Long Polling (HTTP request held open, reconnect after each response) vs WebSocket (persistent bidirectional connection)
Figure 1: Long Polling — HTTP request held open, reconnect after each response — vs WebSocket — persistent bidirectional TCP connection
Why Not SSE (Server-Sent Events)?

Server-Sent Events are server-to-client only. Chat requires client-to-server messaging too (sending messages). SSE would require a separate HTTP POST channel for sends, creating two connections per user. WebSocket handles both directions on one connection. Never suggest SSE for a chat system in an interview.

GATEWAY

WebSocket Gateway: Internal Architecture

The WebSocket Gateway is the most critical component in a chat system. Each gateway server manages approximately 50,000 concurrent WebSocket connections. With 200 million concurrent users, you need approximately 4,000 gateway servers. Each server runs four internal components working in concert.

WebSocket Gateway internals — Connection Manager (50K connections), Heartbeat Monitor, Session Registry, Message Router
Figure 2: WebSocket Gateway internals — Connection Manager, Heartbeat Monitor (25s ping), Session Registry updater, Message Router
Component Responsibility Key Detail
Connection Manager Tracks all active WebSocket connections in memory In-memory map: user_id → socket_fd. ~50K entries per server.
Heartbeat Monitor Detects disconnected clients Sends ping frame every 25 seconds. No pong = client dropped = clean up session.
Session Registry Updater Writes user → server mapping to Redis On connect: SET session:{user_id} gateway-42 EX 300. On disconnect: DEL.
Message Router Delivers messages locally or forwards cross-server If recipient on same server: direct delivery. Else: gRPC to correct gateway.
Interview Narration: Gateway Architecture

"Each WebSocket gateway manages 50K connections. When a connection arrives, the Connection Manager maps the user_id to the socket file descriptor in memory. Simultaneously, the Session Registry writes user_id → gateway_server_id to Redis with a 5-minute TTL (refreshed by heartbeats). The Heartbeat Monitor pings every 25 seconds — no pong means the connection is dead, we clean up Redis, and the client will reconnect. The Message Router checks if the recipient is local (direct delivery) or remote (gRPC to their gateway)."

ROUTING

Cross-Server Message Routing: The Key Challenge

The central challenge of distributed chat: User A is connected to Gateway Server 42. User B is connected to Gateway Server 789. How does a message from A reach B? The answer is the Redis session store: a mapping of user_id → gateway_server_id. The Chat Service looks up B's gateway, then routes the message directly to Gateway 789 via gRPC (~1ms) or through an internal Kafka topic (~5ms).

User A on Gateway 42 sends to User B on Gateway 789 — Redis session store enables cross-server routing via Chat Service
Figure 3: Cross-server routing — Redis session store maps user_id → gateway_server_id. Chat Service routes to the correct gateway via gRPC (~1ms)
The Key Insight Interviewers Look For

The Redis session store (user_id → gateway_server_id) is the most important component in a distributed chat system. Without it, you'd need to broadcast messages to all gateways — O(n) instead of O(1). Every strong-hire answer mentions this lookup by name. If an interviewer asks "how does the message get to the right server?", this is the answer.

Deep Dive 2

Message Ordering — The Problem and Its Solution

PROBLEM

Why Ordering Is Harder Than It Looks

Message ordering seems trivial but is one of the hardest problems in distributed messaging. Due to network jitter, packet retransmission, or load balancer routing, "How are you?" might arrive at the server before "Hello." If the system stores and delivers messages in arrival order, User B sees them out of order.

The ordering problem — network delay causes message reordering. Solution: server assigns monotonic sequence numbers (TIMEUUID)
Figure 4: Network delay causes messages to arrive out of order. Server-assigned TIMEUUID restores correct ordering at storage time.
SOLUTION

Server-Assigned TIMEUUID: The Correct Fix

The server (not the client) assigns a monotonically increasing sequence number to each message as it arrives. In Cassandra, this is a TIMEUUID — a UUID that encodes the server's timestamp with microsecond precision plus a random component to break ties. The TIMEUUID is used as the clustering key in Cassandra, which stores messages sorted by this key within each partition.

Why Server-Assigned, Not Client-Assigned?

Client clocks are unreliable: User A's phone might be 5 minutes ahead. User B's phone might be 3 minutes behind. Client-assigned timestamps would produce ordering chaos. Server-assigned TIMEUUIDs use the server's NTP-synchronized clock, providing a consistent time reference within each conversation. This is a non-negotiable design choice.

Interview Narration: Message Ordering

"When a message arrives at the Chat Service, I assign it a TIMEUUID — this is a Cassandra-native UUID type that encodes the current server timestamp at microsecond precision, plus a random component to break ties within the same microsecond. I store this as the clustering key in Cassandra's messages table, partitioned by conversation_id. Cassandra physically sorts rows by clustering key, so all messages in a conversation are stored in chronological order automatically. I never trust client-side timestamps — client clocks drift and can be manipulated."

SCOPE

Per-Conversation Ordering vs Global Ordering

Per-conversation ordering (Cassandra partition key per conversation) vs global ordering (not needed for chat — bottleneck)
Figure 5: Per-conversation ordering via Cassandra partition key — all WhatsApp needs. Global ordering would be a bottleneck at 1.2M msgs/sec.

WhatsApp needs only per-conversation ordering — users compare message order within a single chat, never across different conversations. Global ordering would require a single coordination point across all 1.2 million messages/second — a fatal bottleneck. Cassandra's partition key (conversation_id) naturally provides per-conversation ordering with no coordination overhead.

Approach Mechanism Scalability Use When
Per-conversation Cassandra partition: conversation_id, clustering: TIMEUUID Horizontal — each partition independent Chat (WhatsApp, Slack, Discord) ✓
Global ordering Single Kafka partition or Zookeeper counter Single bottleneck — max ~500K/sec Financial ledgers, audit logs
Deep Dive 3

Offline Sync — Retrieving Missed Messages on Reconnect

RECONNECT

The Offline Sync Flow: 6 Steps

While offline, messages continue arriving and are stored in Cassandra. When the user reconnects, the system must efficiently synchronize all missed messages without re-sending messages the user has already seen. The key mechanism is per-conversation cursors — the client tracks the last message_id it has seen in each conversation.

Offline sync timeline — user was online (last seq=42), goes offline (15 messages arrive), reconnects and syncs the delta
Figure 6: Offline sync timeline — user last saw sequence 42, 15 messages arrived while offline, reconnect fetches only the delta
  1. User B opens the app. The client establishes a WebSocket connection to the nearest gateway server.
  2. The client sends a sync request: {user_id, last_sync_timestamp, per_conversation_cursors}. Each cursor is the message_id of the last message B has seen in each conversation.
  3. The server computes the delta: SELECT * FROM messages WHERE conversation_id = ? AND message_id > last_seen_cursor
  4. The server first pushes unread counts per conversation (instant badge numbers), then streams actual messages in batches of 50, most recent first.
  5. The client stores messages in its local SQLite database, updates cursors, and sends delivery acknowledgments.
  6. If the user typed messages while offline (outbox), the client sends them now. The server deduplicates by client-assigned message_id.
STATE

Client-Server State Synchronization

Client state (local SQLite, cursors, outbox) and server state (Cassandra, Redis unread counts, Redis sessions)
Figure 7: Complete state synchronization model — client SQLite + cursors + outbox syncing with server Cassandra + Redis
State Location Details
Message history Client: SQLite  |  Server: Cassandra Cassandra is source of truth. Client is a cache of recent messages.
Per-conversation cursors Client: local storage Last seen message_id per conversation. Drives delta query on reconnect.
Pending outbox Client: SQLite queue Messages typed while offline. Sent on reconnect. Server deduplicates by client UUID.
Unread counts Server: Redis HINCRBY unread:{user_id} {conv_id} 1 on every message arrival. Pushed first on sync.
Session state Server: Redis user_id → gateway_server_id. Cleared on disconnect, set on connect.
Why Batches of 50 (Most Recent First)?

If a user was offline for 3 days with 2,000 missed messages across 40 conversations, dumping all messages at once would block the UI. Streaming in batches of 50, most-recent-first, means the user sees the newest messages immediately while older messages load in the background. The unread count badge is pushed before any messages, so the UI is responsive from the first render.

Deep Dive 4

Multi-Region Replication — Global Architecture

ARCHITECTURE

Three-Region Deployment

WhatsApp serves 2 billion users across every country. Multi-region deployment places infrastructure in 3+ regions. Messages within a region take ~100ms. Cross-region messages require ~200ms for the inter-region hop. Each region contains a complete, independent stack — no single-region dependency for normal operations.

Three-region deployment — India (primary), US East, Europe — each with full stack. Async Cassandra and Kafka replication between regions.
Figure 8: Three-region architecture — India, US East, Europe. Full stack per region. Async Cassandra replication + Kafka MirrorMaker between regions.
Region 1
India (Primary)
WebSocket Gateway cluster, Chat Service, Cassandra nodes, Redis, Kafka. Serves India + South Asia.
Region 2
US East
Complete independent stack. Replicates from India async. Serves Americas. LOCAL_QUORUM consistency.
Region 3
Europe
GDPR-compliant stack. European user data stored locally. Async replication for delivery routing.
FLOW

Cross-Region Message Flow: 8 Steps, ~300ms

Cross-region message flow — User A (India) to User B (US) in 8 steps totaling approximately 300ms
Figure 9: Cross-region message — User A (India) sends to User B (US). 8 steps, ~300ms total end-to-end latency.
  1. Message stored in India's Cassandra using LOCAL_QUORUM (~5ms)
  2. Chat Service checks India Redis — User B not connected locally (no session found)
  3. Message published to Kafka cross-region topic in India
  4. Kafka MirrorMaker replicates to US Kafka cluster (~100–200ms)
  5. US Chat Service consumes message from Kafka
  6. US Chat Service looks up User B's session in US Redis → finds Gateway 789
  7. Message routed via gRPC to Gateway 789 (~1ms)
  8. Gateway 789 delivers to User B's WebSocket connection. Total: ~300ms
Design Decision Choice Reason When to Switch
Read consistency LOCAL_QUORUM All reads are local. No cross-region read latency needed for chat history. Never — cross-region reads add 200ms latency with no benefit
Write replication Local + async replication Cassandra async replication for durability. Kafka for real-time cross-region delivery. Sync if financial data requiring global consistency
Conflict resolution Last-Write-Wins (TIMEUUID) Safe because chat messages are append-only — no updates. CRDT if mutable data (e.g., presence, shared docs)
Session store Per-region, NOT replicated Avoids global session replication complexity. User is online in one region at a time. Never — cross-region session replication adds 200ms per connect
CHECKLIST

12-Point Design Checklist

12-point design checklist covering all four deep-dive topics: real-time, ordering, offline sync, and multi-region
Figure 10: 12-point design checklist — use this to verify your answer covers all four deep-dive areas before wrapping up
Class Summary: The Four Hard Problems, Solved
  • Real-Time Messaging: WebSocket (not SSE, not polling). 50K connections/server. ~4,000 servers for 200M concurrent users. Sticky LB routing required. Gateway runs Connection Manager + Heartbeat Monitor (25s ping) + Session Registry updater + Message Router.
  • Cross-Server Routing: Redis session store maps user_id → gateway_server_id. Chat Service lookups Redis, routes via gRPC (~1ms) or Kafka (~5ms). This is the #1 insight interviewers look for.
  • Message Ordering: Server-assigned TIMEUUIDs (never client timestamps). Cassandra: partition key = conversation_id, clustering key = TIMEUUID. Per-conversation ordering only — global ordering is a bottleneck at 1.2M msgs/sec.
  • Offline Sync: Per-conversation cursors on reconnect. Delta query: message_id > last_seen_cursor. Unread counts pushed first (Redis HINCRBY). Messages streamed in batches of 50, most recent first. Outbox pattern + client UUID deduplication.
  • Multi-Region: 3 regions, full independent stack each. LOCAL_QUORUM reads and writes. Async Cassandra replication. Kafka MirrorMaker for cross-region real-time delivery. Session store per-region (not replicated globally). LWW via TIMEUUID for conflict resolution.

Want to Land at Google, Microsoft or Apple?

Watch Pranjal Jain's free 30-min training — the exact GROW Strategy that helped 1,572+ engineers go from TCS/Infosys to top product companies with a 3–5X salary hike.

DSA + System Design roadmap 1:1 mentorship from ex-Microsoft 1,572+ placed · 4.9★ rated
Watch Free Training →
Pre-Class Post-Class → coming soon