System Design Interview Playbook
Problem statement
In 45–60 minutes, communicate a clear design, tradeoffs, and depth on demand without drowning in premature microservices or buzzwords without semantics.
How the interview “works”
- Clarify requirements (functional, scale, consistency, latency).
- Sketch high-level boxes + data flow in 5–10 minutes.
- Deep dive where interviewer probes — usually storage, scaling hot path, failure modes.
Analogy: Architecture studio critique — professor cares about load-bearing walls (bottlenecks) and evacuation routes (failures), not tile grout color (logo on gateway).
High-level process diagram
Rendering diagram…
Low-level checklist (what “good” contains)
Requirements questions
- Read vs write ratio? Consistency vs availability priority?
- Latency p99 targets? Global or single region?
- Compliance (PII, HIPAA, PCI)?
Back-of-envelope
- QPS, storage/day, bandwidth, fan-out — round numbers OK; show reasoning.
Core building blocks
| Concern | Typical tools (pick & justify) |
|---|---|
| Object blobs | S3 / Blob |
| Hot KV / cache | Redis / Memcached |
| OLTP | PostgreSQL / DynamoDB |
| Search | OpenSearch |
| Stream | Kafka / Kinesis / Event Hubs |
| Async work | SQS / Service Bus |
| Edge | CloudFront / Front Door |
| Auth | Cognito / Entra ID |
Failure modes (always mention)
- Partial outages — degrade (read-only mode), circuit breakers, bulkheads.
- Duplicate events — idempotency keys, outbox.
- Hot keys / partitions — sharding, caching, replication.
Example pacing (45 minutes)
Mermaid Gantt dateFormat varies by renderer; below is an equivalent flowchart (portable) and a table you can reuse in interviews.
Rendering diagram…
| Phase | Minutes (guide) | Goal |
|---|---|---|
| Clarify | ~0–8 | Scope, NFRs, constraints |
| Envelope | ~8–14 | Rough capacity sanity |
| Diagram | ~14–24 | Boxes + data paths |
| Depth | ~24–40 | Storage, scale, failures |
| Close | ~40–45 | Recap + open questions |
Components explained — this design
| Item in diagram | What it is | Why it appears here |
|---|---|---|
| Clarify requirements | Interview phase, not a product. | You de-risk wrong design by locking read/write ratio, consistency, latency, compliance before drawing boxes. |
| Back-of-envelope | Rough QPS, storage, bandwidth estimates. | Interviewers want quantitative thinking; numbers justify Kafka vs SQS, SQL vs NoSQL, etc. |
| High-level diagram | First architecture sketch. | Shows you can decompose without diving into premature microservices. |
| Deep dives | Storage, hot paths, failure modes. | Where senior signal lives: tradeoffs, not buzzwords. |
| Tradeoffs and close | CAP / cost / ops honesty + summary. | Demonstrates you know nothing is free (e.g. global consistency vs latency). |
Shared definitions: 00-glossary-common-services.md
Tricky parts (meta)
| Trap | Fix |
|---|---|
| Buzzword soup | Every tech named gets one job sentence |
| Overfitting CAP | Relate to concrete user-visible symptom |
| Ignoring cost | Mention $ egress, managed vs DIY ops |
| No numbers | Even rough numbers beat silence |
Caveats
- Interview ≠ production — pragmatic MVP first, “if scale 10×” second chapter.
- Team skill is a constraint — boring proven beats novel risky unless startup explicitly wants R&D.
Quick tradeoff cheatsheet
- SQL vs NoSQL: joins vs partition key access pattern clarity.
- Sync vs async: user waits vs eventual UX copy.
- Strong vs eventual: money vs social like counts.
Closing template
Summarize: APIs + storage + async path + scaling lever + two failure modes you handle. Invite questions.
You now have 50 companion docs in system-design/ — cross-link topics (e.g. payments + saga + idempotency) when studying for depth.