name: wbd-system-design-interview
description: |
Whiteboarding track project #8. Learner runs 3 timed system design interview drills
(URL shortener, Twitter feed, chat app) using a 6-step framework: requirements → API
→ data → high-level design → deep dive → trade-offs. Drills clarifying questions,
back-of-envelope capacity estimation, and the "narrate while drawing" tempo. Auto-load
when the learner is in whiteboarding/wbd-system-design-interview or asks about
system design interviews, design TinyURL, design Twitter, scalability whiteboarding,
or how to handle a 45-minute design interview.
Project: wbd-system-design-interview
Track: Whiteboarding · Project: 8 of 9 · Time: ~90 minutes
The system design interview is whiteboarding under pressure. 45 minutes. One interviewer watching every move. A vague prompt ("design Twitter"). Most candidates fail not because they don't know the systems, but because they don't have a framework — they freeze, jump straight to a database, or never ask what's actually being asked. This project gives the learner a 6-step framework, three rehearsals, and the "narrate while drawing" tempo that makes interviewers comfortable.
Project goal
When this project is done, the learner can:
- Walk into a 45-minute system design interview with a 6-step framework memorized: Requirements → API → Data Model → High-Level Design → Deep Dive → Trade-offs.
- Ask clarifying questions in the first 5 minutes — read scale, identify what's in scope.
- Do back-of-envelope capacity estimation (QPS, storage, bandwidth) without panic.
- Narrate while drawing — the interviewer can follow the thinking, not just see the result.
- Recover from getting stuck by stating the trade-off rather than freezing.
- Self-score with a 5-point rubric and identify the weakest of the six steps.
Scope guardrail
This is 3 timed drills + framework drill + recovery patterns. We are not memorizing every system design pattern (consistent hashing, gossip protocols, vector clocks — exist, mention if relevant, don't drill). The point: own the framework and the tempo. The patterns come from reading System Design Interview Vol 1/2 (Alex Xu) AFTER this project.
If the learner asks "should I read the books before this?" — answer honestly: no. Do this project first. The framework gives you a structure that makes the patterns from the books stick. Books first = facts in a vacuum.
Prerequisites
| Prereq | Verify with |
|---|---|
| Completed projects 2-5 (architecture, sequence, state, ER) — knows the diagram types | Can draw a 3-tier web app + a sequence diagram for one flow |
| Basic familiarity with REST APIs, SQL vs NoSQL, caching, queues | Can describe each in one sentence |
| A whiteboard or Excalidraw | — |
| A timer | Phone is fine |
Phases
Phase 1 — The 6-step framework (~10 min)
Goal: Memorize the framework. You'll use it for every interview.
The 6 steps + recommended time-box for a 45-minute interview:
| Step | Time | What you do |
|---|---|---|
| 1. Requirements | 5 min | Ask functional + non-functional. Get scale numbers. Confirm scope. |
| 2. API | 5 min | Sketch the public-facing API (REST endpoints, gRPC methods, or message contracts). |
| 3. Data model | 5 min | Sketch entities (ER style) — what data exists, what relationships. |
| 4. High-level design | 10 min | Box-and-arrow architecture: client, services, queues, DBs, caches, CDNs. |
| 5. Deep dive | 15 min | Interviewer picks 1-2 components or scenarios; go deep — scale, failure modes, alternatives. |
| 6. Trade-offs + wrap-up | 5 min | "What I'd do differently with more time, what I'd add, what concerns remain." |
The opening monologue (rehearse this):
When the interviewer says "design X," your FIRST words should be:
"Great. Before I jump in — let me ask a few questions to scope it. Then I'll sketch the API, then the data model, then the high-level architecture, then we can dive into whichever piece you want. Sound good?"
This buys you:
- A moment to think.
- An explicit contract about HOW you'll use the time.
- A signal that you have a framework (most candidates don't).
Drill — write the 6 steps on the board in big letters down the left margin:
1. REQS
2. API
3. DATA
4. HLD
5. DEEP
6. TRADE
Now they're visible to you AND to the interviewer. You can't lose track.
Concepts to name out loud:
- This is the framework as the scaffolding — you're not making up the structure on the fly. The framework holds while you focus on the content.
- This is why the opening monologue matters — interviewers grade you on communication too. Signaling structure in the first 30 seconds shifts the room's posture from skepticism to following along.
After-action prompt: "You memorized 6 steps. If the interviewer interrupts at step 3 and asks an unrelated question, you can say 'good question, I'll come back to that when we hit step 5 (deep dive).' That's what a framework lets you do."
Phase 2 — Drill #1: Design TinyURL (~25 min, timed)
Goal: First timed drill. The classic warm-up. Use the framework strictly.
Set a 25-minute timer. Use the framework time-boxes (5 / 5 / 5 / 10 / — / — for this short version — skip deep-dive for the warm-up).
Step 1 — Requirements (5 min):
Ask out loud (and answer for yourself, simulating an interviewer):
- Functional: Shorten a long URL into a short URL. Resolve a short URL back to the original (redirect). Maybe analytics?
- Non-functional: How many URLs created per day? (Assume 100M / day = ~1,200 / sec.) How many redirects? (Assume 10x reads = 1B / day = ~12,000 / sec). Latency? (< 100ms for redirect.) Availability? (99.99% — it's a redirect service.)
- Scope: custom aliases? Expiration? Analytics? (Decide what's in / out. Default: shortening + redirect + 5-year persistence; analytics is stretch.)
Step 2 — API (5 min):
POST /shorten
body: { url: string, custom_alias?: string }
→ 200 { short_url: string }
GET /{short_code}
→ 301 redirect to original
Step 3 — Data model (5 min):
URL_MAPPING
short_code (PK, varchar(8))
long_url (text)
created_at (timestamp)
created_by (user_id, nullable)
click_count (bigint)
expires_at (timestamp, nullable)
Mention: ~100M new URLs/day × 5 years = ~180B rows. Storage per row ~500 bytes → ~90 TB. Need sharding.
Step 4 — High-level design (10 min):
[Client] → [Load Balancer] → [API Servers] ──► [Redis Cache (short → long)]
──► [URL Database (sharded by short_code)]
──► [ID Generator (Snowflake or counter)]
[Analytics Pipeline] ◄── async events from API
Key design decisions to mention:
- Short code generation: counter-based (sequential, then base62-encode) OR hash-based (MD5 of URL, take first N chars). Counter is simpler; hash allows dedup of identical URLs. Mention both.
- Cache: Redis in front of the DB for hot URLs. ~90% of reads hit cache.
- Sharding: by short_code (consistent hashing), since lookups are by short_code.
- CDN for the redirect endpoint: the response is tiny (a 301), often cached at the edge for popular URLs.
Step 5 + 6 (skip for the warm-up, or do a 5-min lightning trade-offs round):
- "If we used hash-based codes, we'd avoid storing duplicates but lose custom aliases. I chose counter-based for simplicity."
- "Cache hit rate determines latency. If it drops below 90%, we'd see DB pressure."
- "Sharding by short_code means range queries (find all my URLs) are slow. We'd need a secondary index by user_id."
Stop the timer. How did you do?
Concepts to name out loud:
- This is scale numbers as a forcing function — once you say "100M/day," it forces "sharded DB" rather than "single Postgres." Numbers drive architecture.
- This is how the framework saves you — even if you've never thought about URL shortening, the 6 steps give you a path. Don't deviate.
After-action prompt: "You ran the framework on a familiar problem. Where did you slow down? That's the weakest step. Drill that step before drill #2."
Phase 3 — Drill #2: Design Twitter feed (25 min, timed) (25 min)
Goal: Harder problem. The big trade-off: fan-out-on-read vs fan-out-on-write.
Set a 25-minute timer.
Step 1 — Requirements (5 min):
- Functional: Users post tweets. Users follow other users. Users see a feed of tweets from people they follow.
- Non-functional: 300M monthly active users. 200M tweets/day = ~2,300/sec write. Feed reads ~10B/day = ~115K/sec read. Latency < 200ms for feed load. Eventual consistency OK.
- Scope: Just the home feed. No DMs, no retweets, no media (for the interview).
Step 2 — API:
POST /tweets body: { text } → 201 { tweet_id }
GET /feed → list of recent tweets from followed users
POST /follow/{user_id}
DELETE /follow/{user_id}
Step 3 — Data model:
USER (id, username, ...)
TWEET (id, user_id, text, created_at)
FOLLOW (follower_id, followee_id, since)
Step 4 — High-level design:
[Client] → [LB] → [API Servers]
│
├──► [TWEET DB] (sharded by user_id)
├──► [USER DB]
└──► [FOLLOW DB]
[Feed Service] ◄── reads from TIMELINE CACHE (Redis) ── pre-computed per user
[Fan-Out Workers] ◄── consume TweetPosted events ── push tweet to followers' timeline cache
Step 5 — Deep dive (15 min) — the BIG question: fan-out-on-write vs fan-out-on-read.
Approach A: Fan-out-on-write (push)
- When user tweets, push the tweet ID into the timeline cache of every follower.
- Reads are cheap (just
LRANGEfrom Redis). - Writes are expensive — celebrity with 50M followers = 50M cache writes per tweet.
- Storage: each user has a timeline cache (~ 1KB × users × cache_depth).
Approach B: Fan-out-on-read (pull)
- When user loads feed, query "tweets from people I follow, sorted by time, limit 100."
- Writes are cheap (just store the tweet).
- Reads are expensive — for a user following 1000 people, that's a 1000-way merge.
- No celebrity problem on write.
Approach C: Hybrid (the real answer)
- Push for most users (cheap writes, cheap reads).
- Pull for celebrities (avoid the 50M-write storm).
- At read time, merge pre-pushed timeline with celebrity-pulled tweets.
Trade-offs to articulate:
- "Pure push doesn't scale for celebrities. Pure pull doesn't meet read latency. Hybrid is the answer most production systems converge on."
- "What's the threshold for 'celebrity'? Some teams use follower count (>10K?), some use cost-model. It's tuned operationally."
Step 6 — Trade-offs + wrap-up:
- "Eventually consistent — users might see a tweet 30 seconds late. Acceptable for social feeds, not for banking."
- "Hot-spotting on celebrities — special handling needed."
- "Search isn't covered; would need Elasticsearch or similar."
Concepts to name out loud:
- This is fan-out as the canonical social-graph trade-off — every social product (Twitter, Instagram, Facebook) faces this. Knowing the three options is mandatory.
- This is the celebrity problem — power-law distributions break naive designs. Always ask: "what does the long tail look like? what does the head look like?"
After-action prompt: "You ran the framework on a harder problem. Did you hit time pressure in step 5 (deep dive)? That's the most-time-consuming step; budget for it in real interviews."
Phase 4 — Drill #3: Design a chat app (~20 min, timed)
Goal: Different problem class (real-time push instead of batch reads). Same framework.
Set a 20-minute timer.
Step 1 — Requirements (3 min):
- Functional: 1:1 and group chat. Real-time delivery. Read receipts (skip if time short).
- Non-functional: 100M users, 50M concurrent. Messages should arrive within 1 second. Persist messages for history.
- Scope: Text only. No voice/video. No file attachments.
Step 2 — API (2 min):
- WebSocket connection: client opens persistent connection to chat server.
- Messages over WebSocket:
{ type: 'message', to: user_id_or_group, text: string }. - REST for history:
GET /conversations/{id}/messages?before=....
Step 3 — Data model (2 min):
USER (id, ...)
CONVERSATION (id, type: 'direct'|'group')
PARTICIPANT (conversation_id, user_id)
MESSAGE (id, conversation_id, sender_id, text, sent_at)
Step 4 — High-level design (8 min):
[Client] ─WebSocket─► [Chat Server Pool] ─► [Message Queue (Kafka)] ─► [Message DB]
│
└─► [Presence Service] (who's online)
│
└─► [Notification Service] (push for offline users)
Key design decisions:
- WebSocket vs long-poll vs SSE: WebSocket for bi-directional real-time.
- Routing: which chat server holds the user's connection? Consistent hashing on user_id, with a "user → server" registry (Redis or ZooKeeper).
- Delivery to recipient: sender's chat server enqueues message; recipient's chat server picks it up and pushes via WebSocket. If recipient is offline, queue for later + send push notification.
Step 5 — Deep dive (3 min): How do we know if a message was delivered?
- Recipient's client ACKs the message back over its WebSocket.
- ACK propagates back to sender → "delivered" indicator turns blue.
- If no ACK in N seconds → mark "pending."
- If recipient is offline → message is in the DB; will deliver when they reconnect.
Step 6 — Trade-offs (2 min):
- "WebSocket connections are sticky — load balancers need to handle long-lived connections. Use a TCP load balancer (L4) not HTTP (L7)."
- "Scaling chat servers: each holds N connections. Horizontal scale = more servers + routing."
- "End-to-end encryption is a whole separate design (Signal protocol). Out of scope for this conversation."
Concepts to name out loud:
- This is WebSocket as the right primitive for real-time — bi-directional, persistent, low-overhead per message. Long-poll is the fallback for environments that can't do WebSocket.
- This is the offline-user problem — real-time systems must gracefully degrade to async (notifications, message queue). Without this, offline users miss messages.
After-action prompt: "You ran the framework on a 3rd problem class. Notice: the framework didn't change. The CONTENT changed. That's the win — the framework transfers."
Phase 5 — Self-score + recovery patterns (~10 min)
Goal: Self-assess against a rubric. Internalize how to recover from getting stuck.
5-point rubric (score each interview drill 1-5):
| Dimension | 1 (poor) | 3 (ok) | 5 (strong) |
|---|---|---|---|
| Clarifying questions | Jumped to design without asking | Asked 1-2 questions | Asked functional + non-functional + scope, got scale numbers |
| API + data model | Skipped or hand-waved | Sketched both, mostly correct | Specific endpoints, explicit primary keys, cardinality clear |
| Architecture | Vague boxes, no labels | Components + arrows | Clear shapes, labeled arrows, sync vs async distinguished |
| Trade-offs articulated | "I'd use X" with no reasoning | Mentioned 1 alternative | Discussed 2-3 alternatives with explicit pros/cons |
| Communication / narration | Silent while drawing, mumbled | Talked through some moves | Continuous narration, signposted each step, paused for interviewer input |
Score your 3 drills. Identify the weakest dimension.
Recovery patterns — when you get stuck:
- "Let me state the trade-off." Even if you don't know the right answer, naming the trade-off (consistency vs availability, push vs pull, sync vs async) shows judgment.
- "What I'd want to know more about is..." Names a gap honestly — better than bluffing.
- "I'll come back to that — let me finish this section first." Gives you time. Don't get derailed.
- "Can I sketch and explain at the same time?" Resets the tempo — gets you drawing again.
- "I don't know X, but I'd approach it by..." Honest > bluffed. Interviewers can smell a bluff.
Concepts to name out loud:
- This is interviews as performance + content — content gets you to 60%. Communication + framework get you to 90%.
- This is why the rubric beats vibes — "I think it went well" is unactionable. "Communication was a 2 because I went silent for 5 minutes" is fixable.
After-action prompt: "You self-scored 3 interviews. Your weakest dimension is your homework. Drill that dimension for 30 min/week until it's a 4+."
When to break the method
- Learner has interviewed before → skip Phase 1 framework explanation, go straight to Phase 2 drills.
- Learner is going for FAANG-style 60-minute interviews → add a 4th drill (design Uber, design Dropbox) and emphasize Phase 5 deep dive.
- Learner is going for IC roles (not senior) → spend more time on Phase 2 (URL shortener); deep architecture in step 5 is less critical at IC level.
Definition of done
Observable, the learner can:
- Recite the 6-step framework with time-boxes.
- Run the framework on a brand-new problem (interviewer-supplied) in 25 minutes.
- Ask functional + non-functional clarifying questions in the first 5 minutes.
- Estimate QPS, storage, and bandwidth using back-of-envelope math.
- Articulate at least 2 trade-offs per design (with pros/cons each).
- Self-score against the 5-point rubric and name the weakest dimension.
Next project
→ wbd-capstone-present-a-system — capstone. Pick a real system you've built or used, draw all 4 diagram types for it, present it live to another human, take their feedback, redraw. The whiteboarding skill graduates from drills into real communication.