hld-writer

star 1

Use when creating a High-Level Design document for a system, product, or major feature. Triggers: "HLD", "high level design", "system design", "architecture document", "system architecture", "design document", "capacity planning", "service architecture", "how should we architect", or before designing any system involving multiple services, significant scale, or major infrastructure decisions. HLD comes before LLD.

Prathmesh2000 By Prathmesh2000 schedule Updated 3/5/2026

name: "hld-writer" description: "Use when creating a High-Level Design document for a system, product, or major feature. Triggers: "HLD", "high level design", "system design", "architecture document", "system architecture", "design document", "capacity planning", "service architecture", "how should we architect", or before designing any system involving multiple services, significant scale, or major infrastructure decisions. HLD comes before LLD."

HLD Writer Skill

Produce comprehensive High-Level Design documents. HLD defines the what and why of a system at architectural level — service boundaries, data flows, technology choices, trade-offs, and capacity planning — before any LLD or implementation begins.


HLD vs LLD

HLD (this skill):              LLD (lld-writer skill):
──────────────────────         ──────────────────────────
System-level view              Component-level view
WHY these services exist       HOW each class is structured
Technology selection           Method signatures + types
Service boundaries             DB column definitions
Data flow between services     API request/response schemas
Non-functional requirements    Business logic rules
Capacity estimates             Unit test plan
Risk and trade-off analysis    Implementation checklist
Written before LLD             Written after HLD approved

HLD Document Template

# High-Level Design: [System / Feature Name]

**Author:** HLD Architect
**Date:** [YYYY-MM-DD]
**Status:** Draft → Review → Approved
**Related PRD:** [link]
**Reviewers:** CTO, Backend Architect, DevOps Engineer, Security Analyst

---

## 1. Executive Summary

[3-5 sentences: what we're building, why, and the key architectural decision.
Non-technical stakeholders should understand this section.]

Example:
"We are building a real-time order tracking system that allows customers to
see live updates as their orders move through fulfillment. The system must
handle 50k concurrent connections at peak. We will use WebSockets (Socket.io)
on dedicated tracking servers with Redis pub/sub to fan out status updates
from our fulfillment service."

---

## 2. Goals and Non-Goals

### Goals
- [ ] [Specific, measurable goal 1]
- [ ] [Specific, measurable goal 2]

### Non-Goals (Explicitly out of scope for v1)
- [What we are NOT building — prevents scope creep]

### Success Criteria
| Metric | Current | Target | Measurement |
|---|---|---|---|
| p95 API latency | N/A | < 200ms | APM dashboard |
| Uptime | N/A | 99.9% | PagerDuty |
| Concurrent users | N/A | 10,000 | Load test |

---

## 3. Background and Context

### Current State
[What exists today — architecture, pain points, constraints]

### Why Now
[What's forcing this design — scale, new product requirement, tech debt]

### Constraints
- **Technical:** [Must use existing auth service / Cannot change DB schema]
- **Time:** [Must ship by Q3 — some complexity deferred to v2]
- **Cost:** [Monthly infra budget ceiling: $X]
- **Team:** [2 backend engineers, 1 DevOps — no ML expertise]

---

## 4. System Architecture

### C4 Context Diagram

```mermaid
C4Context
  title System Context: Order Tracking

  Person(customer, "Customer", "Tracks their orders via app")
  Person(ops, "Operations Team", "Monitors fulfillment")

  System(orderTracker, "Order Tracking System", "Real-time order status")

  System_Ext(fulfillmentSys, "Fulfillment System", "Updates order status")
  System_Ext(notifSys, "Notification Service", "Sends SMS/email")
  System_Ext(authSys, "Auth Service", "Validates user tokens")

  Rel(customer, orderTracker, "Tracks orders via", "WebSocket/HTTPS")
  Rel(ops, orderTracker, "Views dashboard", "HTTPS")
  Rel(fulfillmentSys, orderTracker, "Pushes status updates", "Event stream / webhook")
  Rel(orderTracker, notifSys, "Triggers notifications", "REST API")
  Rel(orderTracker, authSys, "Validates tokens", "REST API")

C4 Container Diagram

C4Container
  title Container Diagram: Order Tracking System

  Container(webApp, "Web App", "React SPA", "Customer tracking UI")
  Container(apiGateway, "API Gateway", "nginx / Kong", "Auth, rate limiting, routing")
  Container(trackingApi, "Tracking API", "Node.js + Express", "REST + WebSocket endpoints")
  Container(workerService, "Event Worker", "Node.js", "Consumes fulfillment events")
  ContainerDb(postgres, "Orders DB", "PostgreSQL", "Order data and status history")
  ContainerDb(redis, "Redis", "Redis Cluster", "WebSocket pub/sub + session cache")
  Container(fulfillmentQueue, "Event Queue", "SQS / Kafka", "Fulfillment status events")

  Rel(webApp, apiGateway, "HTTPS / WSS")
  Rel(apiGateway, trackingApi, "HTTP + WS proxy")
  Rel(trackingApi, postgres, "Read orders", "Sequelize")
  Rel(trackingApi, redis, "Pub/sub + cache", "ioredis")
  Rel(workerService, fulfillmentQueue, "Consumes events", "SQS SDK")
  Rel(workerService, postgres, "Updates status")
  Rel(workerService, redis, "Publishes updates", "PUBLISH")

5. Service Descriptions

Service Responsibility Technology Scaling
API Gateway Auth validation, rate limiting, SSL termination nginx + Kong Horizontal, stateless
Tracking API WebSocket connections, REST endpoints Node.js 20, Express, Socket.io Horizontal (Redis adapter)
Event Worker Consume fulfillment events, update DB, broadcast Node.js, BullMQ consumer Horizontal (independent)
Orders DB Source of truth for order state PostgreSQL 15 (RDS Multi-AZ) Vertical + read replicas
Redis WebSocket pub/sub, session/cache Redis 7 Cluster Horizontal cluster
Event Queue Durable fulfillment event buffer AWS SQS FIFO Managed, auto-scaling

6. Data Architecture

Data Flow Diagram

flowchart LR
  A[Fulfillment System] -->|status update| B[SQS FIFO Queue]
  B -->|consume| C[Event Worker]
  C -->|UPDATE order status| D[(PostgreSQL)]
  C -->|PUBLISH order:{id}| E[(Redis Pub/Sub)]
  E -->|broadcast| F[Tracking API instances]
  F -->|emit over WebSocket| G[Connected Clients]

Data Ownership

Data Owner Service Storage Access Pattern
Order records Tracking API PostgreSQL Read-heavy, occasional write
Status history Tracking API PostgreSQL Append-only
Active WS sessions Tracking API Redis (TTL 1h) High-frequency read/write
Fulfillment events Event Worker SQS (consumed) Sequential consume

7. Key Design Decisions

For each major decision, document as a mini-ADR:

Decision 1: WebSocket vs SSE for real-time updates

Factor WebSocket SSE
Bidirectional ✅ Yes ❌ No (server→client only)
Client support ✅ All browsers ✅ All modern browsers
Infra complexity Medium (socket server) Low (HTTP stream)
Horizontal scale Needs Redis adapter Stateless
Use case fit Chat, collab ✓ One-way push

Decision: SSE for order status (one-way, simpler). WebSocket only if we add two-way chat. Consequences: Simpler infra, no Redis adapter needed for v1.

Decision 2: SQS vs Kafka for fulfillment events

Factor SQS FIFO Kafka
Ops overhead Low (managed) High (self-managed or MSK)
Throughput 3,000 msg/s (FIFO) 1M+ msg/s
Message retention 14 days Configurable (days to forever)
Replay capability ❌ Once consumed ✅ Replay by offset
Cost (10M msg/mo) ~$4 MSK ~$200+/month

Decision: SQS FIFO for v1 (volume < 1k events/min, ops simplicity, cost). **Revisit if:** Event volume > 50k/min or replay needed for ML/analytics.


8. Non-Functional Requirements

Performance

Requirement Target Rationale
REST API p95 latency < 200ms Acceptable for UI responsiveness
WebSocket/SSE event delivery < 500ms end-to-end "Real-time" perception
Status update fanout < 2s worst case SQS delay + processing + broadcast

Scalability

Peak load estimate:
  - 50,000 concurrent WebSocket connections
  - 5,000 order status updates/minute
  - 200 REST API requests/second

Scaling strategy:
  - Tracking API: horizontal (stateless with Redis adapter)
  - Event Worker: horizontal (N consumers, FIFO order per order ID)
  - PostgreSQL: read replica for GET endpoints
  - Redis: cluster mode for pub/sub scale

Availability

Target: 99.9% uptime (8.7 hours downtime/year)
  - Multi-AZ PostgreSQL (RDS)
  - Redis Cluster (3 nodes, 1 replica each)
  - API instances: min 2, across 2 AZs
  - SQS: AWS-managed, 99.999% SLA

Single points of failure:
  - Redis: mitigated by cluster + persistence
  - PostgreSQL: Multi-AZ automatic failover

Security

- All connections over TLS (HTTPS / WSS)
- JWT validation at API Gateway (no token → reject before service)
- Users can only track their own orders (RBAC in service)
- No PII in Redis (store order IDs only, resolve from DB)
- SQS messages signed (AWS IAM)
- Rate limiting: 100 req/min per user

9. API Surface

High-level API definition (detailed contracts in LLD):

REST:
  GET  /api/orders/:id/status    → Current status + history
  GET  /api/orders               → User's orders (paginated)

SSE/WebSocket:
  GET  /api/orders/:id/stream    → Live status updates stream
  
Internal (service-to-service):
  POST /internal/orders/:id/status → Update status (from Event Worker)
                                     Auth: service API key, not user JWT

10. Infrastructure Overview

AWS us-east-1:
  VPC (10.0.0.0/16)
    Public subnets (2 AZ):  ALB, NAT Gateway
    Private subnets (2 AZ): ECS services, RDS, ElastiCache

  ECS Fargate:
    tracking-api:   min 2, max 10 tasks × (0.5 vCPU, 1GB RAM)
    event-worker:   min 1, max 5 tasks  × (0.25 vCPU, 512MB RAM)

  RDS PostgreSQL:  db.t3.medium, Multi-AZ, 100GB gp3
  ElastiCache:     cache.r7g.large × 3 (Redis cluster)
  SQS FIFO:        order-status-events.fifo

  Estimated monthly cost: $380-480 at expected load

11. Risks and Mitigations

Risk Probability Impact Mitigation
Redis cluster unavailable Low High Graceful degrade: REST polling fallback
SQS message duplication Medium Low Idempotent status updates (check current status before update)
DB connection exhaustion Low High Connection pool + pgBouncer
Event Worker lag > 30s Medium Medium Dead letter queue + alert at > 100 messages queued
WebSocket auth token expiry High Low Refresh via parallel REST call

12. Migration Plan (if replacing existing system)

Phase 1 (week 1-2): New system built, tested in staging
Phase 2 (week 3):   Shadow mode — new system processes events, old system still primary
Phase 3 (week 4):   Gradual traffic shift: 10% → 50% → 100% via feature flag
Phase 4:            Old system decommissioned after 2 weeks stable operation
Rollback:           Feature flag → 0%, old system back in < 5 minutes

13. Open Questions

# Question Owner Due Blocks
1 Should SSE or WebSocket be used? HLD Architect [date] Architecture
2 Kafka vs SQS at 2× growth? CTO [date] Infra sizing
3 GDPR: how long to retain status history? Legal/PM [date] DB schema

HLD Review Sign-offs Required

  • CTO: architecture direction approved
  • Backend Architect: service design approved
  • DevOps Engineer: infra design approved
  • Security Analyst: security design approved
  • DB Architect: data model approved
  • PM: NFRs meet product requirements

→ After all sign-offs: create LLD for each service


---

## HLD Quality Checklist (reviewer uses this)

- [ ] C4 context + container diagrams present and accurate
- [ ] Every service has defined responsibility (single reason to exist)
- [ ] Data ownership is clear — no service owns data it doesn't need
- [ ] All design decisions have documented alternatives + rationale
- [ ] NFRs are quantified (latency in ms, uptime as %, scale as req/s)
- [ ] Capacity estimates back the infrastructure choice
- [ ] Single points of failure identified with mitigations
- [ ] Open questions have owners and due dates
- [ ] Cost estimate included
- [ ] Migration strategy defined if replacing existing system
Install via CLI
npx skills add https://github.com/Prathmesh2000/cursor_agent-orchestrator --skill hld-writer
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
Prathmesh2000
Prathmesh2000 Explore all skills →