System design is the highest-weighted round at senior SDE, SDE-2, and SDE-3 levels across Indian tech companies. A poor DSA round might get you rejected from one offer — a poor system design round gets you rejected from every offer you chase. This guide covers the complete framework, all 20+ most-asked design problems, theory fundamentals, and India-specific fintech/e-commerce context you won't find elsewhere.
Questions in Indian companies blend global patterns (URL shortener, Twitter) with local flavor: UPI payment systems, Aadhaar-linked KYC, IRCTC-scale flash sales, Zomato real-time delivery, and Jio CDN at 400M+ users. Know both.
What System Design Interviews Actually Test
Interviewers don't want the "right" architecture — they want to see how you think. The evaluation is across 5 dimensions:
Requirements Clarity (20%)
Do you ask the right clarifying questions? Functional vs non-functional. DAU, QPS, storage estimation.
High-Level Design (25%)
Can you draw a clean architecture diagram with clear component responsibilities and data flows?
Deep Dive (30%)
When probed, can you explain the internals: DB schema, API contracts, consistency model, sharding strategy?
Trade-offs (15%)
SQL vs NoSQL, consistency vs availability, monolith vs microservices — do you know when and why?
Scale & Bottlenecks (10%)
Where does your design break at 10x traffic? How would you fix it with caching, sharding, queues?
The 5-Step System Design Framework
Use this exact framework for every system design problem. Practice until it's muscle memory.
Step 1 — Clarify Requirements
Never start designing without asking these:
- Functional: What features must we support? What are out of scope?
- Scale: How many DAU? QPS at peak? Data size in 5 years?
- Non-functional: Latency SLA (p99 under 200ms?), availability (99.9%?), consistency requirements?
- Constraints: On-prem or cloud? Specific tech stack?
Step 2 — Back-of-Envelope Estimation
| Metric | Formula | Example (100M DAU) |
|---|---|---|
| Read QPS | DAU × reads/day ÷ 86400 | 100M × 10 ÷ 86400 ≈ 11,500 QPS |
| Write QPS | DAU × writes/day ÷ 86400 | 100M × 1 ÷ 86400 ≈ 1,150 QPS |
| Storage / year | writes/day × record_size × 365 | 100M × 500B × 365 ≈ 18 TB/yr |
| Bandwidth (read) | Read QPS × response_size | 11,500 × 1 KB ≈ 11.5 MB/s |
| Cache needed | 20% hot data rule | 18 TB × 20% ≈ 3.6 TB |
System Design Fundamentals You Must Know
CAP Theorem
A distributed system can guarantee at most 2 of 3: Consistency (all nodes see same data), Availability (every request gets a response), Partition Tolerance (system works despite network failures). Since partitions are inevitable in any real distributed system, you choose between CP or AP:
CP Systems
Consistent + Partition Tolerant. May return errors during partition. Examples: HBase, Zookeeper, etcd. Use for financial data, inventory counts.
AP Systems
Available + Partition Tolerant. May return stale data during partition. Examples: Cassandra, DynamoDB, Couchbase. Use for social feeds, product catalog.
Consistent Hashing
When you add/remove servers in a traditional hash (key % N), almost all keys remap — causing massive cache misses. Consistent hashing places servers and keys on a virtual ring. Only K/N keys remap when a server is added/removed (where K = keys, N = servers). Used in: distributed caches (Redis Cluster), CDN routing, Cassandra partitioning.
Database Selection Guide
| Scenario | Best Choice | Why |
|---|---|---|
| User profiles, transactions | PostgreSQL / MySQL | ACID, complex joins, strong consistency |
| Social feed, timelines | Cassandra | Wide-column, high write throughput, AP |
| Product catalog, content | MongoDB | Flexible schema, document model |
| Session, rate limit counters | Redis | In-memory, atomic ops, TTL support |
| Full-text search | Elasticsearch | Inverted index, ranking, fuzzy search |
| Time-series (metrics, logs) | InfluxDB / TimescaleDB | Optimized for sequential writes, rollup queries |
| Graph relationships (friends, recs) | Neo4j / Neptune | Native graph traversal, BFS/DFS at scale |
| Large files, media | S3 / GCS / MinIO | Object storage, cheap, CDN-integrated |
Key Scalability Patterns
Caching (L1/L2/CDN)
Redis/Memcached for hot data. Cache-aside, write-through, write-back. Cache invalidation is the hardest problem.
Message Queues
Kafka for high-throughput event streaming. RabbitMQ for task queues. Decouple producers/consumers, enable async processing.
Database Sharding
Horizontal partitioning by user_id, geo, or date. Consistent hashing for even distribution. Watch out for hot shards.
Read Replicas
Master for writes, N read replicas for reads. Eventual consistency. Typical read:write ratio at scale is 100:1.
CDN
Push static content to edge nodes (Cloudflare, Akamai). Reduce latency from 300ms to 30ms. Critical for media-heavy apps.
Load Balancing
Round-robin, least-connections, IP hash. Layer 4 vs Layer 7. Sticky sessions for stateful services.
System Design Problem Frequency by Company
| Design Problem | Amazon | Flipkart | Zomato/Swiggy | Paytm/PhonePe | |
|---|---|---|---|---|---|
| URL Shortener | ★★★★★ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ | ★★☆☆☆ |
| Social Media Feed (Twitter) | ★★★★★ | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | ★★☆☆☆ |
| Ride-Sharing (Uber) | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | ★★★★★ | ★★★☆☆ |
| Video Streaming (Netflix) | ★★★★★ | ★★★★★ | ★★★★☆ | ★★☆☆☆ | ★★☆☆☆ |
| Chat (WhatsApp) | ★★★★★ | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ |
| Payment System / UPI | ★★★☆☆ | ★★★★☆ | ★★★★☆ | ★★★★☆ | ★★★★★ |
| Rate Limiter | ★★★★★ | ★★★★★ | ★★★★☆ | ★★★☆☆ | ★★★★★ |
| Notification System | ★★★★☆ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ |
| E-Commerce Search | ★★★☆☆ | ★★★★★ | ★★★★★ | ★★★☆☆ | ★★★☆☆ |
| Real-Time Location | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | ★★★★★ | ★★★☆☆ |
Top 20 System Design Problems With Solutions
1. URL Shortener (like bit.ly)
Scale: 100M URLs shortened/day, 10B redirects/day (100:1 read:write ratio)
Core Components:
- API Layer: POST /shorten → returns shortCode; GET /{code} → 301 redirect
- Short Code Generation: Base62 encoding of auto-increment ID (7 chars = 62^7 = 3.5 trillion unique URLs)
- Storage: MySQL for URL mapping (id, short_code, original_url, created_at, user_id)
- Cache: Redis with LRU eviction — 80% of redirects hit top 20% of URLs
- CDN: Cache redirect responses at edge for sub-10ms latency
Key Trade-off: 301 (permanent) redirect lets browsers cache and skip your servers entirely — better for performance. 302 (temporary) redirect forces all requests through your servers — needed for click analytics.
Interviewers love asking: "How do you handle custom URLs like bit.ly/my-brand?" Add a custom_alias column, check uniqueness before write, and reject duplicates with 409 Conflict.
2. Twitter / Social Media Feed
Scale: 300M DAU, 500M tweets/day, 28K tweet QPS at peak
The core challenge is feed generation. Two approaches:
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Fan-out on Write (Push) | When user tweets, immediately push to all followers' feed caches | Instant feed reads, O(1) read | Celebrity with 10M followers = 10M writes per tweet |
| Fan-out on Read (Pull) | Merge timelines of followed users on read | No write amplification | Slow reads, high latency for active users |
| Hybrid (Twitter actual) | Push for regular users, pull for celebrities (>10K followers) | Best of both worlds | Complex implementation |
3. WhatsApp / Chat System
Scale: 2B users, 100B messages/day (1.16M messages/second)
Key Decisions:
- Protocol: WebSocket for persistent bidirectional connection (not HTTP polling)
- Message Storage: Cassandra — high write throughput, no complex joins, TTL for message expiry
- Message ID: Snowflake ID (timestamp + machine ID + sequence) — globally unique, time-sortable
- Delivery Receipt: Single tick (sent to server), double tick (delivered to device), blue tick (read)
- Online Presence: Redis with user_id → last_seen timestamp, TTL of 30s
- Media: S3 + CDN. Client compresses before upload. Store only hash — dedup identical files.
- End-to-End Encryption: Signal Protocol — X3DH key exchange, Double Ratchet algorithm
4. Netflix / Video Streaming
Scale: 238M subscribers, 15% of global internet traffic, 1B+ hours watched/day
- Video Upload Pipeline: Upload raw video → transcode into multiple resolutions (240p to 4K) → encode in multiple formats (H.264, H.265, AV1) → segment into 2-4 second chunks → upload all to CDN
- Adaptive Bitrate Streaming (DASH/HLS): Client measures bandwidth every 5s, switches quality tier dynamically
- CDN Strategy: Netflix Open Connect — proprietary ISP-embedded CDN boxes. 95% traffic served from ISP's data center
- Recommendation: Collaborative filtering + content-based filtering. A/B test thumbnail art per user segment
- Storage: S3 for raw video, Cassandra for user activity/watch history, MySQL for billing
5. Uber / Ride-Sharing
Scale: 5M trips/day, 93M MAU, 3.5M drivers
- Driver Location: Drivers send GPS every 4 seconds → Kafka → Location Service → Redis Geohash index
- Matching: Geohash-based spatial lookup (find all drivers in same 500m cell + neighbors) → score by ETA + rating → assign best driver
- Surge Pricing: Supply/demand ratio in each Geohash cell. Redis counter per cell, update every 5 min. Multiplier = f(demand/supply)
- Trip State Machine: REQUESTED → ACCEPTED → ARRIVED → IN_TRIP → COMPLETED / CANCELLED
- Communication: WebSocket from app to driver app via push gateway
6. Rate Limiter
Algorithms Compared:
| Algorithm | How | Pros | Cons |
|---|---|---|---|
| Token Bucket | Bucket refills at fixed rate; each request consumes a token | Allows bursts, simple | Burst at boundary edge |
| Leaky Bucket | Queue requests; process at constant rate | Smooth output rate | Drops if queue full |
| Fixed Window | Count per fixed time window (e.g., 100/minute) | Simple | Double-rate at window edge |
| Sliding Window Log | Log each request timestamp; count in last N seconds | Precise | High memory usage |
| Sliding Window Counter | Weighted sum of current + previous window | Low memory, accurate | Slightly approximate |
Redis implementation with sliding window counter:
7. Notification System
Scale: 10M notifications/day, support push (FCM/APNs), SMS (Twilio), email (SES), in-app
- Architecture: API → Notification Service → Priority Kafka Topics → Channel Workers (Push/SMS/Email) → Provider SDKs
- Priority Queues: Critical (OTP, transaction alert) = high-priority topic with 3x consumers. Marketing = low-priority
- User Preferences: DND hours, channel preferences, category opt-outs stored in Redis (fast read)
- Deduplication: Redis SET with idempotency key (user_id + event_id + channel), TTL 24h
- Retry with Exponential Backoff: Failed → retry after 1s, 2s, 4s, 8s, 16s → DLQ after 5 failures
- Template Engine: Mustache/Handlebars templates + user data injection + A/B testing
8. E-Commerce Search (Flipkart / Amazon)
Scale: 500M products, 50M searches/day, results in under 100ms
- Indexing: Elasticsearch inverted index. Product attributes tokenized. Synonyms (saree = sari). Stemming (running = run)
- Ranking Signals: BM25 relevance + CTR + conversion rate + price + in-stock status + personalization score
- Faceted Filtering: Brand, price range, rating, category → Elasticsearch aggregations
- Typeahead/Autocomplete: Redis sorted set or Elasticsearch completion suggester. Prefix trie for instant results
- A/B Testing: Multiple ranking models run in parallel. Winner promoted after statistical significance
9. Payment System / UPI (India-Specific)
Scale: 10B UPI transactions/month (Paytm/PhonePe combined), 2000 TPS at peak festivals
UPI flows: Payer App → NPCI Switch → Payee Bank. Every transaction requires PSP (Payment Service Provider) registration with NPCI. Two-factor auth = UPI PIN + device binding. Idempotency is mandatory — NPCI can send duplicate callbacks.
- Idempotency: Unique transaction_id per payment. Check Redis cache before processing. Return cached result for duplicates.
- Double-Entry Ledger: Every debit has a corresponding credit. Prevents money creation/destruction bugs.
- Saga Pattern: Distributed transaction across Debit Account → Credit Account → Send Confirmation. Compensating transactions for rollback.
- Fraud Detection: Real-time ML model (velocity checks, device fingerprint, geo-anomaly). Block if risk score > threshold. Async review for medium risk.
- Settlement: T+1 settlement with clearing house (NPCI). Net settlement — batch all transactions and settle net amounts.
10. Real-Time Analytics Dashboard
- Ingestion: Events → Kafka (partitioned by event_type) → Flink/Spark Streaming → Aggregate metrics → TimescaleDB
- Pre-aggregation: 1-min, 5-min, 1-hr, 1-day rollups stored separately for fast dashboard queries
- OLAP: ClickHouse or Apache Druid for sub-second ad-hoc analytics on billions of rows
- Hot Path vs Cold Path: Real-time (Kafka → Redis) for live counters; Batch (S3 → Spark → DW) for historical analysis
30 Most-Asked Design Problems by Category
Microservices vs Monolith — When to Use What
| Factor | Monolith | Microservices |
|---|---|---|
| Team Size | < 10 engineers | 50+ engineers, multiple teams |
| Stage | 0→1 (startup) | 1→N (scale) |
| Deployment | Simple, one unit | Complex, per-service CI/CD |
| Data | Shared database | Database per service |
| Scalability | Scale everything together | Scale hot services independently |
| Failure Isolation | One bug can crash all | Circuit breakers limit blast radius |
| Real Examples | Early Zomato, CRED MVP | Amazon, Flipkart, Paytm at scale |
7 Common Mistakes in System Design Interviews
Jumping to Solution
Spending <2 min on requirements. Always clarify scale, latency SLA, and consistency requirements first.
Over-Engineering
Adding Kafka, Redis, Elasticsearch, and Kubernetes to a 10K DAU system. Start simple, scale when needed.
No Bottleneck Analysis
Not identifying where the system breaks at 10x load. Always ask: what's the single point of failure?
Forgetting Data Model
Drawing components without specifying DB schema. Interviewers want to see your DB design thinking.
Handwaving Trade-offs
"We'll use Kafka for reliability" — why not Redis? Why not SQS? Know your trade-offs.
One-Size-Fits-All DB
Using MySQL for everything. Show you know when to use NoSQL, time-series, or graph databases.
Compensation Data — System Design Skills Premium (India 2026)
Engineers who pass system design rounds command significantly higher packages. The "system design tax" is real:
3-Month System Design Mastery Plan
| Month | Focus | Resources | Goal |
|---|---|---|---|
| Month 1 | Fundamentals | Designing Data-Intensive Apps (Kleppmann), System Design Primer (GitHub) | Master CAP, consistent hashing, DB selection, caching strategies |
| Month 2 | Classic Problems | Grokking System Design, ByteByteGo, YouTube channels | Design URL shortener, Twitter, Uber, WhatsApp, Netflix from scratch |
| Month 3 | Mock Interviews | Prepflix, Exponent, Pramp, peer practice | 2 mock system design sessions/week, get feedback, iterate |
Practice System Design With Real Interviewers
Mock interviews with ex-Amazon, Google, Flipkart engineers. Get feedback on your architecture diagrams, trade-off reasoning, and communication skills.
Start Free Trial →
Pranjal Jain | May 18, 2026 | 25 min read