Topic 1

Partitioning Trade-offs

01

What You Gain vs What You Lose

Partitioning (sharding) is the most powerful and most costly scaling technique in your toolbox. It allows you to exceed the capacity of a single machine by distributing data across multiple machines. The decision to shard is irreversible in most architectures — you cannot easily un-shard later — so understanding the trade-offs before committing is critical.

Partitioning trade-offs — four gains vs four losses
Figure 1: Four gains of partitioning (write scalability, data capacity, parallel queries, fault isolation) vs four losses (cross-shard joins, cross-shard transactions, rebalancing pain, operational complexity).

The Gains:

  • Write Scalability: A single PostgreSQL node maxes out at ~50K writes/second. With 4 shards, you get 200K writes/second because each shard handles its own subset. This is the primary reason to shard — writes cannot be parallelized within a single node.
  • Data Capacity: A single node's disk is finite (typically 1–16 TB). With 4 shards, you store 4x the data. Google Spanner uses thousands of shards to store petabytes. Each shard is a manageable chunk of data with its own index.
  • Parallel Queries: A full-table scan that takes 10 minutes on one node takes 2.5 minutes across 4 shards in parallel (scatter-gather). Analytical workloads benefit enormously from this.
  • Fault Isolation: If Shard 1 goes down, Shards 2–4 still serve their users. A single-node database is a single point of failure for all data.

The Losses:

  • Cross-Shard Joins: If User 42's orders are on Shard 1 and Product 99's details are on Shard 3, a JOIN requires querying both shards, transferring data over the network, and merging results in application code. This scatter-gather pattern has high latency and operational complexity.
  • Cross-Shard Transactions: ACID transactions across shards require two-phase commit (2PC), which is slow (2x latency), fragile (coordinator crash = blocked transactions), and reduces availability. The practical advice: design your data model to avoid cross-shard transactions.
  • Rebalancing Pain: When you add a new shard, data must be migrated from existing shards. This is an expensive, disruptive operation. Consistent hashing minimizes the data movement but doesn't eliminate it.
  • Operational Complexity: Each shard is an independent database server: backups, monitoring, schema migrations, and failover must be managed N times. Your ops burden scales with shard count.
Interview Principle

In interviews, propose sharding only when you've exhausted vertical scaling and read replicas. Start with a single large node, add read replicas, then shard when writes become the bottleneck. The interviewer will ask: "How did you decide to shard?" The answer is: when write throughput exceeded what a single primary can handle.

02

Partition Key Trade-off: Locality vs Distribution

The partition key determines two things: how evenly data is distributed (avoiding hotspots) and how much data is co-located (enabling local queries). These two goals are in direct tension — you cannot fully optimize for both at once.

Partition key: user_id vs order_id — locality vs distribution trade-off
Figure 2: user_id as partition key gives data locality (all user data co-located) but risks hotspots if a celebrity user generates disproportionate load. order_id gives even distribution but scatters user data across shards.
Partition KeyLocalityDistributionHotspot RiskBest For
user_idAll user data on one shardSkewed if power users existHigh (celebrity problem)Social apps with user-centric queries
order_idOrder data scatteredEven (random IDs)LowOrder processing, analytics
geographyRegional data co-locatedSkewed by populationMedium (dense regions)Geo-local apps (food delivery)
hash(user_id)None (hash destroys locality)EvenVery lowPure write throughput, no range queries
Production Example: Twitter's Hotspot Problem

Twitter's original sharding used user_id. When a celebrity with 100M followers tweeted, one shard received 100x the normal write load as fanout jobs wrote to follower timelines. The fix: special-case celebrity accounts (users with >1M followers) and precompute their fan-out differently, bypassing the shard bottleneck.

Topic 2

Replication Strategies

03

Three Replication Topologies

Replication copies data across multiple nodes for three reasons: durability (data survives node failures), read scaling (distribute reads across replicas), and geographic proximity (replicas near users reduce latency). The topology determines how writes flow through the system.

Three replication topologies: single-leader, multi-leader, leaderless
Figure 3: Three topologies — Single-Leader (one writer, simplest), Multi-Leader (multiple writers, cross-region), Leaderless (any node writes, tunable consistency).
TopologyWrite PathConsistencyConflict RiskReal Systems
Single-Leader All writes → one primary → replicated to followers Strong (sync replication) or eventual (async) None (one writer) PostgreSQL, MySQL, MongoDB, Redis Sentinel, Kafka (per partition)
Multi-Leader Each region has a leader, leaders replicate to each other Eventual (async cross-region) High (concurrent writes to same key in different regions) CockroachDB, Google Spanner, DynamoDB Global Tables
Leaderless Client writes to W nodes simultaneously, reads from R nodes Tunable (W+R > N = strong) Medium (concurrent writes to same key) Cassandra, DynamoDB, Riak

Single-Leader (Primary-Replica): One node (the leader/primary) accepts all writes. It replicates changes to followers (replicas) via a replication log. Followers are read-only. This is the simplest and most common topology. Consistency is configurable: synchronous replication (follower must ACK before primary responds — strong consistency, lower throughput) or asynchronous (primary responds immediately, replicates in background — eventual consistency, higher throughput). Use this as your default in any interview where strong consistency matters.

Multi-Leader (Active-Active): Multiple nodes in different regions each accept writes. Changes are replicated asynchronously between leaders. This gives low write latency for users in each region (writes go to the local leader). The challenge: if two users in different regions update the same record simultaneously, you have a write conflict. Use only when you need low-latency writes across multiple geographic regions and you have a conflict resolution strategy.

Leaderless (Dynamo-style): No designated leader. The client writes to W nodes and reads from R nodes. If W + R > N (total replicas), reads and writes overlap, guaranteeing the client reads the latest write. The most flexible topology but requires the application to handle stale reads and version conflicts. Choose when you need high availability and can tolerate eventual consistency.

04

Quorum Reads & Writes: Tunable Consistency

In leaderless systems, the quorum formula W + R > N guarantees that at least one node in the read set has the latest write. With N=3, you tune consistency vs. availability by choosing W and R:

Quorum configurations: strong writes, balanced, fastest
Figure 4: Quorum configurations with N=3 — W=3,R=1 (strong writes, tolerates no write failures), W=2,R=2 (balanced), W=1,R=1 (fastest, weak consistency).
Strong Writes
W=3, R=1
Every node must ACK write. One-node read always gets latest. Writes fail if any node is down.
Balanced (Most Common)
W=2, R=2
W+R=4 > N=3. Any 2-node overlap guarantees freshness. Tolerates 1 node failure for both reads and writes.
Fast but Weak
W=1, R=1
Lowest latency. No overlap guarantee — may read stale data. Use only when staleness is acceptable.
Cassandra in Practice

Cassandra's consistency level QUORUM sets W=⌈N/2⌉+1, R=⌈N/2⌉+1. With N=3, that's W=2, R=2 — strongly consistent and equivalent to a CP system during normal operation. Cassandra with ONE sets W=1, R=1 — eventual consistency, maximum availability. Most Cassandra deployments use different consistency levels per operation: QUORUM for writes, LOCAL_ONE for non-critical reads.

Topic 3

Failure Scenarios

05

A Taxonomy of What Can Go Wrong

In distributed systems, failure is not an exception — it is the norm. Google's data shows that in a cluster of 10,000 servers, on average 2–3 servers fail per day, 1 network switch fails per month, and 1 rack loses power per year. Design for failure from day one; don't bolt on resilience afterwards.

Six failure types: node crash, network partition, byzantine, slow node, disk corruption, cascading failure
Figure 5: Six failure types — node crash (fail-stop), network partition, byzantine fault, slow node (gray failure), disk corruption, and cascading failure — each requiring different mitigation strategies.
Fail-Stop
Node Crash

Server dies completely. Detected via heartbeat timeout (10–30s). Mitigation: automatic failover, leader re-election.

Network
Network Partition

Cluster splits into isolated groups. Nodes are alive but can't communicate. The trigger for CAP trade-offs.

Gray Failure
Slow Node

The most insidious. Responds but very slowly. GC pause, disk saturation, memory pressure. Circuit breakers handle this.

Systemic
Cascading Failure

One failure triggers a chain. DB slow → threads block → thread pool exhausted → upstream service times out → retries amplify load.

Network partition — cluster splits into two groups that cannot communicate
Figure 6: A network partition splits the cluster into majority and minority groups. Each group sees the other as "down." This is the scenario that makes the CAP trade-off unavoidable.
The Slow Node Is Worse Than a Dead Node

A dead node fails fast — timeouts fire, failover kicks in. A slow node keeps responding just enough to prevent failover, but slow enough to back-pressure the entire system. Thread pools fill up waiting for the slow node. The fix: circuit breakers that detect elevated error rates or p99 latencies and temporarily stop sending traffic to the slow node, giving it time to recover.

Failure TypeDetectionPrimary MitigationInterview Answer
Node crashHeartbeat timeout (10–30s)Automatic failover, replica promotion"We use health checks + automated failover via Pacemaker / k8s"
Network partitionPeer-to-peer timeoutMajority quorum (refuse minority writes)"CAP forces a choice — we prefer CP for payment writes"
Slow nodep99 latency spike, timeout rateCircuit breaker (Hystrix / Resilience4j)"Circuit breaker opens after 5 consecutive failures, 30s recovery"
Cascading failureThread pool exhaustion, queue depthBulkhead isolation, backpressure, rate limiting"Bulkhead: separate thread pools per downstream service"
Disk corruptionChecksum mismatch on readReplication + checksummed storage (ZFS, ext4)"RAID is not a substitute for replication across nodes"

Topic 4

CAP Trade-offs

06

The CAP Theorem: Reframed for Interviews

The CAP theorem states that a distributed system can provide at most two of three properties: Consistency (every read returns the most recent write), Availability (every request gets a non-error response), and Partition Tolerance (the system continues operating despite network partitions).

CAP triangle — CP (etcd, ZooKeeper), CA (single-node PostgreSQL), AP (Cassandra, DynamoDB)
Figure 7: The CAP triangle — CP systems (etcd, ZooKeeper) choose consistency when a partition occurs. AP systems (Cassandra, DynamoDB) remain available but may return stale data. CA (no partition tolerance) only applies to single-node systems.
The Key Reframe

The CAP trade-off only activates when a partition occurs. During normal operation, you get all three properties. When a partition happens, you must choose: reject writes (maintain consistency, lose availability) or accept writes on both sides (maintain availability, risk inconsistency). In an interview, this is the question: "When a partition happens in your system, which do you prefer — returning an error, or returning potentially stale data?"

07

How Real Systems Respond to Partitions

How real databases map to CAP — most are tunable, not purely CP or AP
Figure 8: How real databases map to CAP. Most modern systems are tunable — Cassandra behaves as CP with QUORUM consistency level, and as AP with ONE. The choice is per-operation, not per-system.

CP (Consistency + Partition tolerance): During a partition, the minority side (fewer than N/2+1 nodes) rejects all writes. Only the majority side accepts writes. This ensures no conflicting writes exist — the system is consistent. But users on the minority side get errors. etcd, ZooKeeper, and single-leader PostgreSQL with synchronous replication are CP.

AP (Availability + Partition tolerance): During a partition, BOTH sides continue accepting writes. Users on both sides get responses (no errors). But the two sides may have conflicting data — User 42's balance might be $500 on one side and $300 on the other. Once the partition heals, these conflicts must be resolved. Cassandra with ONE, DynamoDB without transactions, and DNS are AP.

SystemDefault CAP PositionTunable?When to Choose
PostgreSQL (primary only)CA (no partition tolerance)Add replicas → CPSingle-region, strong consistency required
etcd / ZooKeeperCPNoCoordination (locks, leader election, config)
CassandraAP (default)Yes — QUORUM = CPHigh write throughput, eventual consistency OK
DynamoDBAP (eventual reads default)Yes — strongly consistent reads availableScale-out key-value, tunable per read
CockroachDBCPNoMulti-region, strong consistency, SQL interface
MongoDBCP (default with writeConcern majority)Yes — lower writeConcern = AP-likeGeneral-purpose document store
08

Conflict Resolution in AP Systems

AP systems accept writes on both sides of a partition. When the partition heals, conflicting versions must be reconciled. There are three standard strategies:

Three conflict resolution strategies: last-write-wins, CRDTs, application resolution
Figure 9: Three conflict resolution strategies — Last-Write-Wins (simple, lossy), CRDTs (auto-merge, mathematically correct), and application-level resolution (flexible, requires custom logic).

Last-Write-Wins (LWW): Compare timestamps, keep the value with the highest timestamp. Simple to implement but loses data: if two users update the same field at roughly the same time, one update is silently discarded. DynamoDB and Cassandra use LWW by default. Acceptable for use cases where the last write is semantically correct (e.g., "last viewed at" timestamp).

CRDTs (Conflict-free Replicated Data Types): Mathematical data structures designed to merge without conflicts. A G-Counter (grow-only counter) on each node independently increments, and merging takes the max of each node's counter — no data is ever lost. Shopping carts modeled as add-only sets (G-Set) can merge trivially. CRDTs only work for data types where merge is mathematically well-defined. Riak and Redis CRDT module use these.

Application-Level Resolution: The database returns all conflicting versions to the application, which decides how to merge. Amazon's original shopping cart design (DynamoDB paper): if two concurrent operations added different items to the cart, the merged cart contains both. Amazon chose to never lose an item — even at the cost of phantom items appearing. The application knows the domain semantics better than the database.

StrategyData Loss RiskComplexityBest For
Last-Write-WinsHigh (concurrent updates lost)LowMetadata, timestamps, idempotent writes
CRDTsNone (by design)Medium (only specific data types)Counters, sets, add-only collections
Application resolutionDepends on logicHigh (custom per domain)Shopping carts, collaborative editing, business-critical merges
Avoid conflict (CP)NoneLow (use a CP system)Financial transactions, inventory, locks
09

The Decision Framework

The most important takeaway from this class: you do not choose CP or AP for your entire system. You choose per-operation. Within the same application, payment writes use CP (PostgreSQL primary, strong consistency), user feed reads use AP (Cassandra eventual, fast), and session tokens use CP (Redis with quorum). The framework:

Decision framework — need ACID? CP. Tolerate stale reads? AP. Mixed? Hybrid.
Figure 10: Decision tree — if the operation requires ACID guarantees, choose CP. If the system can tolerate stale reads and prioritizes availability, choose AP. Most production systems are hybrid — CP for writes, AP for reads.
Operation TypeCAP ChoiceWhyExample
Payment, debit, inventory decrementCP — strong consistencyData loss or duplicate processing is catastrophicPostgreSQL primary with synchronous replication
User profile read, feed, recommendationsAP — eventual consistencySlightly stale data is invisible to users; availability matters moreCassandra with ONE consistency
Distributed lock, leader electionCP — consensus-backedMultiple processes believing they hold a lock = split-brain disasteretcd, ZooKeeper
Like counts, view countersAP — CRDTs or LWWExact count doesn't matter; approximate is fineCassandra counter column or Redis CRDT
Shopping cartAP — application mergeNever lose an item; merge on conflictDynamoDB with application-level resolution

Class Summary

Four Topics, One Framework

Partitioning

Shard for write throughput and data capacity. Pay the price in cross-shard joins and operational complexity. Partition key choice is irreversible — get locality vs. distribution right upfront.

Replication

Single-leader (simplest, default). Multi-leader (cross-region writes, complex conflicts). Leaderless (tunable via W+R>N). Default to single-leader in interviews unless multi-region writes are explicitly required.

Failure Taxonomy

Node crash → failover. Network partition → CAP choice. Slow node → circuit breaker. Cascading → bulkhead. Gray failures (slow nodes) are more dangerous than hard crashes.

CAP Trade-offs

CAP only matters during a partition. Choose CP for financial operations and locks. Choose AP for reads, counters, and carts. Most real systems are tunable — you choose per-operation, not per-system.

Conflict Resolution

LWW (simple, lossy) → CRDTs (math-safe, limited types) → Application resolution (flexible, complex). For anything financial, avoid AP systems entirely — use CP and avoid the conflict problem.

Interview Rule

Never start with sharding. Never start with multi-leader. Start simple, scale vertically, add read replicas, then shard. Justify every complexity you introduce.

Track Your DSA Progress — It's Free

Stop solving random questions. Start with the right 206 questions across 16 patterns — structured, curated, and completely free.

206 curated questions 16 patterns covered Google login · Free forever
Create Free Account →