WhatsApp System Design Case Study: Scalability, Reliability & Architecture

Chapters

Chapter 1

Case Study: Writing FR/NFR for WhatsApp ↳ Scope · Architecture · 8 FRs · 6 NFRs · Message Delivery Flow

Chapter 2

Deep Dive: Scalability ↳ Vertical vs Horizontal · 7 Strategies · Scalability Metrics

Chapter 3

Deep Dive: Reliability ↳ Redundancy · Defense in Depth · Nines · Reliability Patterns

Chapter 1

Case Study: Writing FR/NFR for WhatsApp

Why WhatsApp Is the Perfect Case Study

WhatsApp is one of the most frequently asked System Design interview questions, and for good reason. It is a deceptively simple product — at its core, it just sends messages between people. But beneath that simplicity lies an engineering marvel that handles over 100 billion messages per day from more than 2 billion users across nearly every country on Earth.

What makes WhatsApp particularly valuable as a learning exercise is that it forces you to think about every major System Design concept: real-time communication (WebSockets), distributed storage (Cassandra), caching (Redis), message queues (Kafka), presence management, media handling, encryption, and more. If you can write clear requirements for WhatsApp, you can write requirements for almost any system.

Step 1: Understand the Scope

Before writing any requirements, you must clarify the scope. WhatsApp has many features: one-on-one messaging, group chats, voice calls, video calls, status/stories, payments, business accounts, and more. In an interview, you cannot design all of these in 45 minutes. You need to pick a focused subset.

✅ In Scope	❌ Out of Scope
One-on-one text messaging	WhatsApp Business features
Group messaging (up to 1024 members)	WhatsApp Payments
Media sharing (images, videos, documents)	Voice/Video calling (separate system)
Message delivery status (sent/delivered/read)	Channels and newsletters
Online/offline presence and last seen	End-to-end encryption key exchange details
Push notifications for offline users	Message search and indexing
Status/Stories (24-hour ephemeral posts)	Account migration between devices

Interview Tip: Always Scope First

Spending 2 minutes scoping the problem saves you from designing the wrong system. Say something like: "I'd like to focus on the core messaging features: 1:1 messages, groups, media sharing, and delivery status. I'll exclude calls and payments for now. Does that sound reasonable?" This shows the interviewer you think before you build.

Step 2: WhatsApp's High-Level Architecture

Before diving into requirements, it helps to understand what WhatsApp looks like at a high level. This context will make the requirements feel concrete rather than abstract.

Figure 1: WhatsApp High-Level Architecture — Clients connect via WebSockets through a load balancer to stateless services backed by distributed data stores

At a high level, WhatsApp's architecture has four layers:

Client Layer: Mobile apps (iOS, Android), web clients, and desktop apps that establish persistent WebSocket connections.
Connection Layer: WebSocket servers maintaining millions of concurrent connections, with a load balancer distributing traffic.
Service Layer: Independent microservices — chat, group, presence, notification, and media service.
Data Layer: Cassandra for message storage (write-heavy, time-ordered), Redis for sessions and presence, PostgreSQL for user/group metadata, and blob storage (S3) + CDN for media.

Functional Requirements for WhatsApp

Functional requirements define what the system must do — the features and behaviors that users directly interact with.

Figure 2: WhatsApp Functional Requirements Map — 8 core feature areas that define the product

FR1

One-on-One Messaging

Send and receive text messages in real-time (up to 65,536 chars). Messages delivered in send order. Stored if recipient is offline, delivered on reconnect.

FR2

Group Messaging

Create groups with up to 1,024 members. All members receive every message. Add/remove members, group admin permissions, group metadata management.

FR3

Media Sharing

Send images (16 MB), videos (2 GB), documents, voice messages, contacts, and location pins. Thumbnails generated for preview before download.

FR4

Message Delivery Status

Sent (single grey tick) → Delivered (double grey) → Read (double blue). Users can disable read receipts for privacy.

FR5

Presence & Last Seen

Show contacts' online status or last seen time. Privacy controls: show to everyone, contacts only, or nobody.

FR6

Push Notifications

When offline or app in background, send push via APNs (iOS) or FCM (Android). Include sender name and message preview. Batch group notifications.

FR7

Status / Stories

Post text, image, or video status updates visible to contacts. Expire after 24 hours. View who has seen your status. Selectively hide from contacts.

FR8

End-to-End Encryption

All messages encrypted using Signal Protocol. Server never has access to plaintext. Keys managed on-device. Group keys rotate when membership changes.

Why FR2 Matters Architecturally

Group messaging introduces fan-out: one message must reach up to 1,024 recipients. For a group of 1,024 members, a single send triggers 1,023 deliveries. If you have millions of active groups, this creates enormous write amplification. The Group Service must efficiently fan out messages using a message queue like Kafka to avoid overwhelming the WebSocket servers.

Why FR4 Matters Architecturally

Delivery status requires acknowledgment flows: the server acknowledges receipt (sent), the client acknowledges receipt (delivered), and the client reports the message was viewed (read). Each acknowledgment is itself a message flowing through the system. For 100 billion messages per day, that means 200–300 billion additional acknowledgment events. This is why efficient, lightweight acknowledgment protocols matter.

Complete FR Summary Table

ID	Requirement	Priority
FR1	Send/receive 1:1 text messages in real-time with in-order delivery	P0 Must Have
FR2	Create/manage groups up to 1024 members with fan-out delivery	P0 Must Have
FR3	Share images, videos, documents, voice messages with thumbnails	P0 Must Have
FR4	Show message status: sent, delivered, read (with privacy controls)	P0 Must Have
FR5	Display online/offline presence and last seen timestamp	P1 Important
FR6	Send push notifications for offline users via APNs/FCM	P0 Must Have
FR7	Post ephemeral status updates that expire after 24 hours	P1 Important
FR8	End-to-end encrypt all content using Signal Protocol	P0 Must Have

Interview Tip: Prioritize Requirements

Use P0 (must have), P1 (important), P2 (nice to have) labels. In an interview, focus your design on P0 requirements first. Mentioning P1/P2 shows thoroughness, but spending time designing P2 features before nailing P0 is a red flag.

Non-Functional Requirements for WhatsApp

While functional requirements define what WhatsApp does, non-functional requirements define how well it must do it. At WhatsApp's scale — 2 billion users, 100 billion messages per day — the non-functional requirements are what truly shape the architecture.

Figure 3: WhatsApp Non-Functional Requirements — six quality attributes that drive every architectural decision

NFR1

Scalability

Target: 2B+ users · 100B+ messages/day · 1.2M QPS peak

Must scale horizontally. Hundreds of WebSocket servers behind load balancers. Data partitioned across database clusters. Each microservice scales independently based on its load pattern.

NFR2

Latency

Target: <100ms message delivery for online users · <2s media on 4G

Persistent WebSocket connections eliminate TCP handshake overhead. Routes through nearest data center. Redis session lookups in microseconds. Media served through CDN edge nodes.

NFR3

Availability

Target: 99.99% uptime — max 52.6 minutes downtime per year

Redundancy at every layer: multiple WebSocket servers, database replicas, cross-region failover, no single point of failure. Must continue operating even when entire data centers go down.

NFR4

Reliability (Zero Message Loss)

Target: Once server ACKs a message, it must never be lost

Write-ahead logging, Kafka with replication, Cassandra with replication factor of 3. Messages removed from delivery queue only after the recipient's device explicitly acknowledges receipt.

NFR5

Security

Target: E2E encryption · server never sees plaintext · GDPR compliance

Signal Protocol for encryption. Keys generated and stored on-device. Server facilitates key exchange but never possesses decryption keys. Group keys rotate on membership changes for forward secrecy.

NFR6

Consistency

Target: In-order delivery per conversation · eventual consistency for presence

Per-conversation sequence numbers guarantee message order. Cassandra partition key is conversation_id with clustering on timestamp. Presence can be eventually consistent — a few seconds of stale data is imperceptible and acceptable.

Complete NFR Summary

ID	Attribute	Target	Why It Matters
NFR1	Scalability	2B users, 100B msgs/day, 1.2M QPS	System must grow with user base
NFR2	Latency	<100ms delivery, <2s media	Users expect instant feel
NFR3	Availability	99.99% (52.6 min/yr downtime)	Communication is mission-critical
NFR4	Reliability	Zero message loss after server ACK	Trust requires guaranteed delivery
NFR5	Security	E2E encryption, GDPR compliance	Private communication is sacred
NFR6	Consistency	In-order per conversation, eventual for presence	Ordered chat, relaxed presence

How Requirements Drive Architecture

Every architectural decision in WhatsApp traces back to a specific requirement. This is what separates a good System Design answer from a great one:

Scalability → Microservices: Each service (chat, groups, presence) can be scaled independently. Groups might need 10x the compute during peak hours, while presence stays steady.
Latency → WebSockets + Redis: Persistent connections eliminate repeated handshakes. Redis session lookups in microseconds ensure messages route to the correct server instantly.
Availability → No Single Point of Failure: Multiple WebSocket servers, Cassandra with replication factor 3, multi-region deployment.
Reliability → Kafka + Cassandra: Kafka provides durable message queuing with at-least-once delivery. Cassandra stores messages with replication across 3 nodes.
Security → Signal Protocol: End-to-end encryption means the server only routes encrypted blobs. Even a complete server breach reveals nothing.
Consistency → Per-Conversation Sequencing: Each conversation has a monotonically increasing sequence number. Cassandra's partition key is conversation_id with clustering on timestamp.

The Requirements-to-Architecture Link

In your interview, explicitly connect requirements to architecture: "Because we need sub-100ms delivery (NFR2), I am choosing WebSocket over HTTP polling. Because we need zero message loss (NFR4), I am persisting messages to Cassandra before acknowledging to the sender. Because groups can have 1024 members (FR2), I am using Kafka for fan-out to avoid overwhelming the chat servers." This kind of reasoning is exactly what interviewers want to hear.

Message Delivery: Putting FR and NFR Together

The message delivery flow is where functional and non-functional requirements collide. Here is what happens when User A sends a message to User B:

Figure 4: WhatsApp Message Delivery Flow — showing online (instant) and offline (queue + push notification) paths

User A types and hits send. The app encrypts the message using User B's public key (FR8) and sends it over the WebSocket connection to the server (NFR2: low latency).

The WebSocket server forwards it to the Chat Service. Chat Service persists to Cassandra (NFR4: reliability) and returns a "sent" ACK to User A. User A sees a single grey tick (FR4).

Chat Service checks Redis to find which WebSocket server User B is connected to (NFR2: fast lookup). If User B is online, the message is pushed through User B's WebSocket instantly (FR1: real-time).

User B's device receives, decrypts, and sends a "delivered" ACK back to the server (FR4). User A sees double grey ticks.

If User B is offline, the message stays in the queue (NFR4: zero loss) and a push notification is sent via APNs/FCM (FR6). When User B comes online, all queued messages arrive in order (NFR6).

When User B opens the conversation, a "read" ACK is sent (FR4). User A sees double blue ticks.

Common Interview Mistake

Many candidates describe the happy path only (both users online). Strong candidates also cover the offline path, the group fan-out path, and failure scenarios (what if the WebSocket server crashes mid-delivery?). Discussing edge cases demonstrates real-world engineering maturity.

Chapter 2

Deep Dive: Scalability

What Is Scalability?

Scalability is the ability of a system to handle increased load without degrading performance. It is not about being fast — that is performance. Scalability is about staying fast as you grow. A system that responds in 50ms for 100 users but takes 10 seconds for 100,000 users is fast but not scalable.

Think of it like a restaurant. A small café with one chef can serve 20 customers per hour beautifully. But when 200 customers show up, the chef cannot cook 10x faster. Scalability is about adding more chefs (horizontal scaling), upgrading to a bigger kitchen (vertical scaling), or opening multiple locations (distributed systems) — while keeping the quality consistent.

Vertical vs Horizontal Scaling

Figure 5: Vertical Scaling (bigger machine) vs Horizontal Scaling (more machines) — each has clear trade-offs

Vertical Scaling (Scaling Up)

Upgrade a single machine: more CPU cores, more RAM, faster SSDs. The simplest approach — your application code does not change.

✅ Advantages	❌ Disadvantages
No code changes required	Hardware has physical limits
No distributed system complexity	Single point of failure
Easy to manage and debug	Cost increases non-linearly (2x specs = >2x price)
Simple data consistency (single machine)	Requires downtime to upgrade

When to Use Vertical Scaling

Ideal for: early-stage startups with limited traffic, databases that are hard to distribute (like a single PostgreSQL instance), and workloads where simplicity is more valuable than fault tolerance. WhatsApp, in its early days with 200 million users, ran on just 32 servers — but they were extremely powerful Erlang machines optimized for concurrency. This is a brilliant example of vertical scaling done right.

Horizontal Scaling (Scaling Out)

Add more machines. Instead of one powerful server, you have 10, 100, or 10,000 smaller servers working together. Traffic is distributed using load balancers; data is partitioned using sharding.

✅ Advantages	❌ Disadvantages
Virtually unlimited scaling capacity	Application code must handle distribution
No single point of failure	Data consistency becomes complex
Cost-effective (commodity hardware)	Operational overhead increases
Scale incrementally (add servers one at a time)	Network latency between machines adds up

The Scalability Toolkit

Scaling is not just about adding servers. It requires a coordinated set of strategies that work together:

Figure 6: Eight key scalability strategies — combine them based on your system's specific bottlenecks

Load Balancing

Distributes incoming requests across multiple server instances. WhatsApp uses Layer 4 load balancing for WebSocket connections to maintain persistent TCP connections. Algorithms: Round Robin, Least Connections, Consistent Hashing.

Database Sharding

Split data across multiple database instances. Shard by conversation_id so all messages in a conversation are on one shard — enabling efficient queries without cross-shard joins. Avoid poor shard keys that create hotspots.

Caching

Redis caches user session data (which WebSocket server each user is connected to), presence information, and recent messages. A well-implemented cache reduces database load by 90–99%, cutting latency from milliseconds to microseconds.

Async Processing

Push notifications, thumbnail generation, analytics, and archiving happen asynchronously via Kafka. A spike in message volume does not overwhelm downstream services — they process events at their own pace.

CDN

Media files cached at edge locations globally. User A uploads to the nearest edge node; User B downloads from the nearest edge node. Dramatically reduces media latency and offloads traffic from central servers.

Microservices

Chat, groups, presence, notifications, media split into independent services. During peak messaging hours, scale chat servers. During a viral status trend, scale media servers. Granular scaling is far more efficient than scaling a monolith.

Auto-Scaling

Automatically adds instances when CPU, memory, or request rate exceeds a threshold; removes them when load decreases. Handles New Year's Eve spikes without paying for idle capacity during quiet periods.

Scalability Metrics: How to Measure It

Metric	Definition	WhatsApp Example
Throughput	Requests processed per second	1.2M messages/sec at peak
Latency (p50)	Median response time	50ms for message delivery
Latency (p99)	99th percentile response time	200ms (worst 1% of requests)
Concurrent Users	Simultaneous active connections	500M+ concurrent WebSocket connections
Data Volume	Data stored and transferred	100B messages/day ≈ 10 PB/month
Horizontal Capacity	Machines needed to serve load	Thousands of servers across regions

Interview Tip: Think in Numbers

Instead of "We need a lot of servers," say: "With 1.2M QPS and each server handling ~10K connections, we need at least 120 WebSocket servers for connection handling alone, with 2x capacity for redundancy — so about 250 servers." This kind of reasoning demonstrates real engineering ability.

The Numbers in Context

100 billion messages per day = 1.16 million messages per second on average. But traffic is not uniform — peak hours can see 3–5x the average. The system must handle 3–6 million messages per second at peak. Each message generates 2–3 acknowledgment events, so the real event throughput is 6–18 million events per second. Sharding by conversation ID distributes this load across hundreds of database nodes.

Chapter 3

Deep Dive: Reliability

What Is Reliability?

Reliability is the ability of a system to perform its intended function correctly and consistently over time, even when things go wrong. A reliable system does not just work — it works when hardware fails, when networks partition, when traffic spikes, and when software bugs surface.

The fundamental truth of distributed systems is that everything fails, all the time. Hard drives fail. Network cables get cut. Entire data centers lose power. Software has bugs. A reliable system is not one that prevents all failures (that is impossible). It is one that continues to function correctly despite failures.

Amazon's Famous Quote

Werner Vogels, CTO of Amazon, famously said: "Everything fails, all the time." This is not pessimism — it is engineering reality. At Amazon's scale (millions of servers), hundreds of components fail every day. The system is designed so that no single failure, or even multiple simultaneous failures, takes down the overall service.

Reliability vs Availability: What Is the Difference?

Concept	Definition	Analogy
Availability	Is the system UP and responding?	Is the hospital open today?
Reliability	Is the system producing CORRECT results?	Does the hospital give correct diagnoses?

A system can be available but unreliable: it is up and responding, but it returns wrong data, drops messages, or produces inconsistent results. A system can also be reliable but unavailable: when it is up it works perfectly, but it experiences frequent downtime. For WhatsApp, both are critical — the system must be up 99.99% of the time AND never lose a message.

Redundancy and Failover

The primary strategy for achieving reliability is redundancy — having backup components that can take over when primary components fail.

Figure 7: Redundancy and Failover — when the primary server fails, the standby automatically takes over

Active-Passive Failover

One server (active/primary) handles all requests. A second server (passive/standby) sits idle, receiving data replication from the primary. If the primary fails, the standby is promoted. Simple but wastes resources. Failure detection works through heartbeats: the primary sends a periodic signal ("I'm alive") to a monitoring system. If heartbeats stop for 15–30 seconds, the system promotes the standby.

Active-Active Failover

Both servers actively handle requests. Traffic is split between them. If one fails, the other absorbs the full load. More efficient (no idle resources) but more complex (both servers must be synchronized). This is WhatsApp's approach: multiple WebSocket servers actively handle connections, and if one goes down, affected users reconnect to another server automatically.

Defense in Depth: Layers of Reliability

A single reliability mechanism is not enough. Production systems use multiple layers of protection, each catching failures that slip through the layer above:

Figure 8: Building Reliability through Defense in Depth — five nested layers from broadest protection to last resort

Multi-Region Deployment

Deploy across multiple geographic regions (US East, US West, Europe, Asia). Protects against regional disasters: data center fires, power grid failures, natural disasters, even undersea cable cuts. If the entire US East region goes down, traffic routes to US West.

Data Replication

Every piece of critical data stored on multiple machines. Cassandra's replication factor of 3 means every message exists on three different nodes, typically in different racks or data centers. If one node's hard drive fails, two copies remain.

Redundancy and Failover

At the component level, every critical service has backup instances. If a WebSocket server crashes, the load balancer detects the failure via health checks and routes new connections to healthy servers. Users experience a brief disconnection but reconnect within seconds.

Health Monitoring & Alerting

Comprehensive monitoring tracks CPU, memory, disk I/O, network latency, error rates, and business metrics (messages/sec, delivery latency). Automated alerts notify on-call engineers. For critical failures, automated remediation scripts restart failed services or trigger failover without human intervention.

Graceful Degradation

The last resort. When overwhelmed, serve partial results instead of crashing. If the presence service is down, WhatsApp can still deliver messages — users just will not see "online" status temporarily. The most critical features (message delivery) are protected at the expense of less critical ones (typing indicators).

WhatsApp's Graceful Degradation in Action

During the massive traffic spike on New Year's Eve 2017, WhatsApp temporarily disabled status updates and typing indicators to protect core message delivery. Users could still send and receive messages without issues. Most users did not even notice the degradation. This is graceful degradation: sacrifice the peripheral to protect the essential.

The Nines of Availability

Availability is measured as a percentage of uptime, expressed in "nines." Each additional nine represents a 10x reduction in downtime — but achieving it costs exponentially more:

Figure 9: The Nines of Availability — each additional nine costs exponentially more to achieve

Availability	Annual Downtime	Monthly Downtime	Typical System
99% (two nines)	3.65 days	7.2 hours	Internal tools, dev environments
99.9% (three nines)	8.76 hours	43.2 minutes	Standard SaaS products
99.99% (four nines)	52.6 minutes	4.38 minutes	WhatsApp, payment systems, banks
99.999% (five nines)	5.26 minutes	25.9 seconds	Air traffic control, stock exchanges

Interview Pitfall: Claiming Five Nines Without Justification

A common mistake is saying "We need 99.999% availability" without understanding the cost. It is far more impressive to say: "Our messaging system needs 99.99% availability because communication is mission-critical. Five nines would require multi-region active-active deployment which adds significant complexity and cost. Given our scale, four nines is the right balance of reliability and engineering investment." This shows you understand the trade-off between availability and cost.

Reliability Patterns Every Engineer Should Know

Retry with Exponential Backoff

When a request fails, retry after 1s, then 2s, then 4s, then 8s. Prevents overwhelming a struggling service. Add jitter (random delay) so thousands of clients do not all retry at the exact same moment.

Circuit Breaker

If a downstream service fails repeatedly, stop calling it temporarily. Like an electrical circuit breaker, this prevents cascading failures. After a cooldown period, send a test request — if it succeeds, resume normal traffic.

Idempotent Operations

Design operations so that executing them multiple times produces the same result as once. If a "deliver message" request times out and is retried, the message should not be delivered twice. Use unique message IDs to deduplicate.

Dead Letter Queues

Messages that cannot be processed after multiple retries are moved to a separate queue for manual investigation. Prevents a single bad message from blocking the entire queue.

Chaos Engineering

Intentionally inject failures in production to test that reliability mechanisms actually work. Netflix's Chaos Monkey randomly terminates server instances to ensure their system can handle it. If you do not test failure, you do not know if your safeguards work.

How Scalability and Reliability Work Together

Figure 10: Scalability and Reliability reinforce each other — a scalable system reroutes traffic during failures; a reliable system stays up during traffic spikes

A scalable system with many server instances naturally has redundancy built in. If one of 100 servers fails, the remaining 99 absorb the load — a trivial impact. But if you only have 2 servers, losing one means the survivor must handle double the load, potentially cascading into a full outage.

Conversely, a reliable system with good failover can scale more confidently. Auto-scaling works best when the system can gracefully handle servers being added and removed without disruption. Health monitoring ensures that newly added servers are healthy before receiving traffic.

For WhatsApp, these two qualities form a virtuous cycle: horizontal scaling across thousands of servers provides natural redundancy, while reliability mechanisms ensure that scaling events do not introduce instability. Together, they enable a system that serves 2 billion users with less than an hour of downtime per year.

The Key Interview Insight

When discussing scalability and reliability, always show how they connect: "By horizontally scaling our WebSocket tier across 250 servers, we get both throughput capacity for 1.2M QPS (scalability) and natural fault tolerance because losing any single server only affects 0.4% of connections (reliability). This is a design where scalability and reliability reinforce each other — which is always what we aim for." This kind of integrated thinking is what earns top marks.

Wrapping Up

Class 1 Complete

This post-class covered three interconnected topics that form the core of System Design thinking:

WhatsApp FR/NFR Case Study: How to systematically write Functional Requirements (8 features: messaging, groups, media, delivery status, presence, notifications, stories, encryption) and Non-Functional Requirements (6 quality attributes: scalability for 2B users, sub-100ms latency, 99.99% availability, zero message loss, end-to-end security, per-conversation consistency).
Scalability Deep Dive: Vertical vs horizontal scaling trade-offs, the toolkit of 7 strategies (load balancing, sharding, caching, async processing, CDN, microservices, auto-scaling), and how to quantify scalability with real numbers.
Reliability Deep Dive: Redundancy and failover patterns, defense in depth (5 layers), the nines of availability, and key reliability patterns (retry, circuit breaker, idempotency, dead letter queues, chaos engineering).

The crucial insight: requirements drive architecture. Every WhatsApp decision — WebSockets over HTTP, Cassandra over MySQL, Kafka for async delivery, Redis for session lookup, CDN for media — traces directly back to a specific FR or NFR. When you can explain those connections clearly in an interview, you demonstrate the kind of engineering thinking that companies are looking for.

Want to Land at Google, Microsoft or Apple?

Watch Pranjal Jain's free 30-min training — the exact GROW Strategy that helped 1,572+ engineers go from TCS/Infosys to top product companies with a 3–5X salary hike.

DSA + System Design roadmap 1:1 mentorship from ex-Microsoft Google login middot; Free forever

Watch Free Training →rarr;

In-Class Notes All Classes

WhatsApp System Design Case Study
Scalability, Reliability & Architecture

Chapters

Chapter 1

Case Study: Writing FR/NFR for WhatsApp

Why WhatsApp Is the Perfect Case Study

Step 1: Understand the Scope

Step 2: WhatsApp's High-Level Architecture

Functional Requirements for WhatsApp

Complete FR Summary Table

Non-Functional Requirements for WhatsApp

Complete NFR Summary

How Requirements Drive Architecture

Message Delivery: Putting FR and NFR Together

Chapter 2

Deep Dive: Scalability

What Is Scalability?

Vertical vs Horizontal Scaling

Vertical Scaling (Scaling Up)

Horizontal Scaling (Scaling Out)

The Scalability Toolkit

Scalability Metrics: How to Measure It

Chapter 3

Deep Dive: Reliability

What Is Reliability?

Reliability vs Availability: What Is the Difference?

Redundancy and Failover

Active-Passive Failover

Active-Active Failover

Defense in Depth: Layers of Reliability

The Nines of Availability

Reliability Patterns Every Engineer Should Know

How Scalability and Reliability Work Together

Wrapping Up

Class 1 Complete

Want to Land at Google, Microsoft or Apple?

Pranjal Jain

Free Training — Watch Now

Class 1 Reading Path