Concept 1

Why Distributed Systems Are Hard

A single-machine program is predictable: one process, one clock, no network. When you move to distributed systems — multiple machines communicating over a network — three fundamental challenges emerge that do not exist on a single machine. Every distributed algorithm, from leader election to consensus to distributed locking, exists to cope with these three demons.

The three demons of distributed computing — unreliable network, unsynchronized clocks, and process crashes
Figure 1: The three demons — unreliable networks, unsynchronized clocks, and process crashes — that make distributed systems fundamentally harder than single-machine programs
DEMON 1

The Network Is Unreliable

Messages between machines can be lost, delayed, duplicated, or reordered. When Node A sends a message to Node B and does not receive a response, it cannot distinguish between: Node B is dead, the network dropped the request, the network dropped the response, or Node B is alive but very slow. This ambiguity is the source of most distributed system bugs.

The FLP Impossibility

The FLP theorem proves that in an asynchronous system with even one faulty process, there is no deterministic algorithm that guarantees consensus. This is why every real consensus system (Raft, Paxos) uses timeouts and probabilistic approaches rather than guarantees.

DEMON 2

Clocks Are Not Synchronized

Each machine has its own physical clock that drifts independently. NTP can synchronize clocks to within a few milliseconds — but that is not good enough for ordering events across machines. If Machine A records an event at 10:00:00.001 and Machine B records an event at 10:00:00.002, you cannot conclude that A's event happened first — the clock difference might be 5ms. This is why distributed systems use logical clocks (Lamport timestamps, vector clocks) instead of wall-clock time for ordering events.

DEMON 3

Processes Can Crash at Any Time

Any node can crash at any moment: mid-write, mid-election, mid-replication. You must design every operation to be safe even under partial failure. If 3 of 5 nodes are alive, the system must still function correctly. This is why distributed algorithms require majority quorums: 3 of 5 nodes (majority) must agree for an operation to succeed, ensuring that even if 2 nodes crash, the remaining 3 have enough information to continue.

The Two Generals Problem

Two armies need to coordinate an attack, communicating via messengers who can be captured (messages lost). General A sends "Attack at dawn." Did General B receive it? B sends "Acknowledged." Did A receive the ACK? A sends "ACK of your ACK." This recurses infinitely — neither general can ever be certain the other will attack.

The lesson: you cannot achieve perfect agreement over an unreliable network. Every distributed protocol accepts this and designs around it using quorums, retries, and timeouts.

Concept 2

Concurrency & Race Conditions

A race condition occurs when the behavior of a system depends on the timing of events that are not properly synchronized. In distributed systems, race conditions are especially dangerous because operations happen on different machines with network delays, making coordination harder than on a single machine with shared memory.

Classic race condition — two users read inventory=1, both decrement to 0, but two items are sold
Figure 2: Classic race condition — two users read inventory=1, both decrement to 0, but two items are sold (only one existed). This is a lost update.

The classic example: an e-commerce site has 1 item in stock. User A and User B both click "Buy" at the same time. Both servers read inventory=1, both decide it is available, both decrement to 0, and both confirm the purchase. Two items sold, but only one existed. This is called a lost update — one update overwrites the other because neither was aware of the other's read.

SOLUTIONS

Four Solutions to Race Conditions

SolutionHow It WorksProsConsBest For
Pessimistic Lock
SELECT FOR UPDATE
Lock the row before read. Other transactions wait. Simple, guaranteed safe Blocks other users (contention). Deadlock possible. Low-contention resources (single-item checkout)
Optimistic Lock
Version column
Read with version. Write WHERE version=N. Retry if version changed. No blocking. High throughput for low-contention workloads. Retries under high contention. Most CRUD operations (profile updates, settings)
Distributed Lock
Redis SETNX
Acquire lock on external service. Only one holder at a time. Works across multiple services and databases. Extra infrastructure. Lock holder crash risks. Cross-service coordination (payment + inventory)
Atomic Operation
DB-level CAS
UPDATE SET qty = qty - 1 WHERE qty > 0 No lock needed. DB handles atomically. Fastest. Limited to simple operations (increment/decrement). Inventory, counters, balance deductions
Interview Tip: Default to Atomic Operations

For counters and inventory, use atomic SQL: UPDATE products SET inventory = inventory - 1 WHERE id = 42 AND inventory > 0. If affected_rows = 0, the item is out of stock. This avoids locks entirely.

For complex multi-step operations (charge payment AND reserve inventory), use a distributed lock or database transaction with SELECT FOR UPDATE.

Concept 3

Distributed Locks

A distributed lock ensures that only one process across multiple machines can execute a critical section at a time. Unlike a database row lock (which works within one database), a distributed lock coordinates across services, databases, and even data centers. The most common implementation uses Redis SETNX (SET if Not eXists) with a TTL (auto-expiry).

Redis distributed lock — Server A acquires with SETNX, Server B fails to acquire. Redlock uses 5 nodes for fault tolerance.
Figure 3: Redis distributed lock — Server A acquires with SETNX, Server B fails to acquire. Redlock uses 5 independent Redis nodes for fault tolerance.
REDIS

The Simple Implementation

  1. Acquire: SET lock:order42 server_a_id NX EX 10 (set only if not exists, expire in 10 seconds)
  2. If OK: lock acquired. Proceed with critical section.
  3. If FAIL: lock held by another server. Wait and retry, or return error.
  4. Release: DEL lock:order42 (only if the value matches server_a_id to prevent releasing someone else's lock)
  5. Safety net: TTL auto-releases if the holder crashes (no manual cleanup needed)
REDLOCK

Multi-Node Safety

A single Redis instance is a single point of failure. Redlock solves this by acquiring the lock on 5 independent Redis instances. If the lock is acquired on a majority (3+) within a time threshold, it is considered held. This survives the failure of up to 2 Redis nodes.

Redlock Controversy

Martin Kleppmann argued that Redlock is not safe under all conditions (clock drift, GC pauses). For truly critical operations, use etcd or ZooKeeper which implement consensus-based locks with stronger guarantees.

FENCING

Fencing Tokens: The Extra Safety Layer

Fencing tokens prevent stale lock holders from writing. Storage rejects writes with token less than highest seen.
Figure 4: Fencing tokens prevent stale lock holders from writing. The storage system rejects any write with a token lower than the highest it has seen.

The biggest danger with distributed locks is the process pause problem. Server A acquires lock (token=33), then pauses for a long GC pause. The lock TTL expires. Server B acquires the lock (token=34) and writes to the database. Server A resumes and writes to the database, overwriting B's data — even though A no longer holds the lock.

Fencing tokens solve this: each lock acquisition returns a monotonically increasing token. The storage system rejects any write with a token lower than the highest it has seen. Server A's write with token=33 is rejected because the database already saw token=34.

Locks Do NOT Guarantee Mutual Exclusion

A distributed lock with TTL provides at-most-once semantics within the TTL window, not true mutual exclusion. If the lock holder is slow (GC, network), the lock expires and another process acquires it — two processes are in the critical section simultaneously. Fencing tokens are the only way to make distributed locks truly safe. Without fencing, use locks only for efficiency (avoiding duplicate work), not for correctness.

Interview Tip: Lock Pattern

"I use Redis SETNX with a 10-second TTL. The lock key is lock:order:{order_id}. The value is the server instance ID for safe release. I also use fencing tokens: the lock service returns a monotonically increasing token, and the database rejects writes with stale tokens. For critical financial operations, I use etcd's lock API which is consensus-backed and safer than Redis."

Concept 4

Leader Election

Many distributed systems require one node to act as the leader: the database primary that accepts writes, the Kafka controller that manages partition assignments, or the scheduler that assigns tasks to workers. Leader election is the process of choosing this node and handling failover when the leader dies.

Leader election — Node B is elected leader, sends heartbeats to followers. If heartbeats stop, followers start a new election.
Figure 5: Leader election — Node B is elected leader. It sends heartbeats to followers. If heartbeats stop, followers start a new election.
HOW IT WORKS

The Election Process

  1. Nodes start as followers. They listen for heartbeats from the current leader.
  2. If a follower does not receive a heartbeat within a timeout (e.g., 300ms), it assumes the leader is dead.
  3. The follower becomes a candidate and requests votes from other nodes.
  4. If the candidate receives votes from a majority (N/2+1), it becomes the new leader.
  5. The new leader starts sending heartbeats to all followers, preventing further elections.
  6. If two candidates start elections simultaneously, the one with the higher term number (epoch) wins. The other steps down to follower.
SPLIT BRAIN

The Most Dangerous Failure Mode

Split brain — network partition creates two groups each electing their own leader. Solution: majority quorum.
Figure 6: Split brain — a network partition creates two groups, each electing its own leader. Solution: require majority quorum so only one group can elect.

Split brain: a network partition divides the cluster into two groups, and each group elects its own leader. Now there are two leaders accepting writes, causing data conflicts and corruption. This is not theoretical — split brain has caused data loss at GitHub (2012), AWS (2017), and many other companies.

Solution — Majority Quorum: a leader can only be elected with votes from a majority (N/2+1) of all nodes, not just the nodes it can reach. With 5 nodes, you need 3 votes. If a partition splits the cluster into groups of 3 and 2, only the group of 3 can elect a leader. The group of 2 cannot elect because it does not have 3 votes. This guarantees at most one leader at any time.

Cluster SizeMajority RequiredTolerates FailuresCommon Use
3 nodes21 node failureetcd, ZooKeeper, small Kafka
5 nodes32 node failuresProduction etcd, ZooKeeper
7 nodes43 node failuresHigh-availability critical systems
Never Use Even Cluster Sizes

4 nodes requires 3 for majority = tolerates 1 failure. 3 nodes also requires 2 for majority = also tolerates 1 failure. The 4th node adds cost and complexity without improving fault tolerance. Always use 3, 5, or 7 nodes. 5 is the production standard for etcd and ZooKeeper.

Concept 5

Consensus: How Nodes Agree

Consensus is the fundamental problem in distributed systems: how do N nodes agree on a single value (or sequence of values), even when some nodes crash or messages are lost? Leader election is a special case of consensus (agree on who the leader is). Log replication is another case (agree on the order of writes).

Raft protocol — three states (Follower, Candidate, Leader), transitions via timeout and voting, log replication via majority ACK
Figure 7: Raft protocol — three states (Follower, Candidate, Leader), transitions via timeout and voting, log replication committed on majority ACK
RAFT

The Understandable Consensus Algorithm

Raft was designed by Diego Ongaro at Stanford specifically to be understandable (unlike Paxos, which is notoriously difficult). Raft decomposes consensus into three sub-problems: leader election (choose one leader via majority vote), log replication (leader replicates its log to followers), and safety (only a node with the most complete log can become leader).

Raft is used by etcd, CockroachDB, TiKV, and Consul.

LOG REPLICATION

How Writes Work in Raft

  1. Client sends a write request to the Leader.
  2. Leader appends the write to its local log as an uncommitted entry.
  3. Leader sends the log entry to all Followers via AppendEntries RPC.
  4. Each Follower appends to its log and ACKs the Leader.
  5. When a majority (N/2+1) of nodes have the entry, it is committed.
  6. Leader applies the committed entry to its state machine and responds to the client.
  7. Followers learn about the commit in the next heartbeat and apply to their state machines.
Key Insight

A write is committed only when a majority of nodes have it. This means even if the leader crashes immediately after responding to the client, the committed entry exists on a majority of nodes and will survive. The new leader (elected from the majority) will have the entry — no data is lost.

Four production consensus systems — etcd, ZooKeeper, CockroachDB, Consul
Figure 8: Four production consensus systems — etcd (Kubernetes), ZooKeeper (Kafka), CockroachDB (distributed SQL), Consul (service mesh)
SystemProtocolPrimary UseCluster SizeUsed By
etcdRaftDistributed KV store, K8s state3 or 5Kubernetes, CoreDNS
ZooKeeperZAB (Paxos-like)Coordination, Kafka metadata3 or 5Kafka, Hadoop, HBase
CockroachDBRaft (per range)Distributed SQL database3+ per rangeDoorDash, Netflix
ConsulRaftService discovery, KV config3 or 5HashiCorp ecosystem
Google SpannerPaxos + TrueTimeGlobal SQL, external consistency5+ per splitGoogle internally
Interview Tip: Never Implement Consensus Yourself

"For leader election, I use etcd's election API (backed by Raft consensus). For distributed configuration, I use etcd's KV store. I do NOT implement Raft myself — consensus algorithms are notoriously hard to get right, and battle-tested implementations like etcd and ZooKeeper handle all edge cases." This shows you know when to build vs when to use existing tools.

Concept 6

Consistency Models & Decision Guide

Consistency models define what a client can observe when reading data that is being written concurrently by others. Stronger consistency means clients always see the latest data, but it is slower (every read must check with the majority). Weaker consistency means clients may see stale data, but reads are faster (any replica can respond).

Consistency spectrum from linearizable (strictest, slowest) to eventual consistency (weakest, fastest)
Figure 9: From linearizable (strictest, slowest) to eventual consistency (weakest, fastest) — the fundamental trade-off in distributed systems
ModelGuaranteeLatencyUse Case
LinearizableEvery read sees the most recent write. Strictest.High (quorum read)Banking, leader election, locks
SequentialAll nodes see operations in the same order.MediumZooKeeper, ordered logs
CausalCausally related operations are ordered; concurrent may differ.LowerSocial feeds (see your own posts immediately)
EventualAll replicas converge eventually. No ordering guarantee.LowestCassandra, DynamoDB, DNS, CDN

Most production systems use eventual consistency for the majority of operations (reads from replicas, cache-aside pattern) and strong consistency only where it matters (financial transactions, inventory, leader election). Application-level strategies like read-your-writes (route the user's own reads to the primary) give the illusion of strong consistency for the most important use case — the user seeing their own recent changes — without the performance cost of making every read linearizable.

Decision tree — concurrent access → distributed lock, single coordinator → leader election, replicated state → consensus
Figure 10: Decision guide — which problem maps to which solution and which tool
ProblemSolutionToolExample
Prevent two users buying the last itemDistributed LockRedis SETNX / etcd lockE-commerce inventory
Choose one DB primary to accept writesLeader Electionetcd / ZooKeeperPostgreSQL failover
All nodes agree on the order of writesConsensus (Raft)etcd / CockroachDBReplicated state machine
One scheduler assigns tasks to workersLeader Electionetcd / ZooKeeperJob scheduler, Kafka controller
Store cluster configuration reliablyConsensus KV storeetcd / ConsulKubernetes cluster state
Prevent split-brain after network partitionMajority QuorumRaft / ZABAny replicated system

Pre-Class Summary

Six Concepts to Know Before Class

Why It's Hard

Unreliable networks, unsynchronized clocks, and process crashes. Every distributed algorithm exists to cope with these three demons.

Race Conditions

Lost updates happen when two operations read and write without coordination. Default to atomic SQL for counters; distributed locks for cross-service coordination.

Distributed Locks

Redis SETNX + TTL for basic locking. Redlock for multi-node safety. Fencing tokens add correctness. For critical ops, use etcd (consensus-backed).

Leader Election

Heartbeats + majority voting. Split brain prevented by N/2+1 quorum. Always use odd cluster sizes (3, 5, 7). etcd and ZooKeeper implement this.

Consensus (Raft)

Leader election + log replication + safety. Writes committed when majority ACK. Used by etcd, CockroachDB, Consul. Never implement yourself.

Consistency Models

Linearizable (strictest) → eventual (fastest). Use eventual consistency for most reads. Use strong consistency only for financial operations and locks.

Track Your DSA Progress — It's Free

Stop solving random questions. Start with the right 206 questions across 16 patterns — structured, curated, and completely free.

206 curated questions 16 patterns covered Google login · Free forever
Create Free Account →