name: performance-profiling description: Find and fix backend bottlenecks — connection pooling, p95/p99 latency, load testing with SLO-aligned thresholds (k6), and CPU profiling (flame graphs). Use when latency is high, throughput plateaus, or before scaling traffic. Not for DB query specifics (use query-optimization) or read-load shedding (use caching-strategy). license: MIT
Performance Profiling
Purpose
Locate the actual backend bottleneck with data (not intuition), fix it, and prove the fix with load tests gated on tail latency — so the system scales predictably.
Universal — measure-first profiling, p95/p99 over averages, connection pooling, and SLO-gated load testing are backend-perf principles; the profiler/load tool differs.
Procedure
Measure before optimizing — find the real bottleneck
- Use observability metrics (
observability-setup) to see WHERE time goes: DB? external call? CPU? lock contention? - The bottleneck is usually DB query time (→
query-optimization), not app CPU — confirm before profiling app code
- Use observability metrics (
Right-size connection pooling
- DB connections are scarce; an unbounded/oversized pool exhausts the DB, an undersized one starves the app
- Use a transaction-mode connection pooler for serverless/high-concurrency (specific poolers in Implementation)
- Pool size ≈ based on DB max connections ÷ instances, not "bigger is better"
Load test with SLO-aligned thresholds
- Define
p(95)/p(99)latency + error-rate thresholds IN the test - Tail latency (p99), not average, defines API health — averages hide the slow requests users feel
- Test realistic scenarios (ramp-up, sustained, spike), not just a flat hammer
- Define
CPU profiling when app code is the bottleneck
- Capture a flame graph; find the hot path (sync work blocking the event loop in Node, GIL contention in Python)
- Move CPU-heavy work off the request path (worker /
background-jobs) or optimize the hot function
4b. Memory profiling — leaks and unbounded growth
- "Slow over time" is often a leak, not a CPU issue: take two heap snapshots at different runtimes and diff (retainers tab in Chrome DevTools /
node --heapsnapshot-signal) - Watch RSS / heapUsed trends in observability; a steady climb between deploys = leak
- Common Node leaks: event-listener accumulation, closures over large objects in a module-level cache, unbounded Map/array
4c. Event-loop lag as a first-class signal (Node-specific, conceptually general)
- The event loop blocks → every concurrent request stalls (not just the slow one). Monitor
eventLoopUtilization()/monitorEventLoopDelayand alert at threshold - Equivalent: GC pause % (any runtime), garbage-collection time per minute
Address tail latency causes
- GC pauses, cold caches, connection acquisition waits, N+1 (→
query-optimization), lock contention (→transaction-management)
- GC pauses, cold caches, connection acquisition waits, N+1 (→
Validate (validation loop)
- Re-run the load test after each fix; if p99 still breaches the SLO threshold → the bottleneck moved; re-profile and fix the new top contributor
- Gate the load test in CI (fail on threshold breach) so regressions are caught
Anti-patterns
| ❌ Anti-pattern | ✅ Correct |
|---|---|
| Optimizing by guessing | Profile/measure first; fix the proven bottleneck |
| Tuning on average latency | Gate on p95/p99 tail latency |
| Unbounded / "bigger" connection pool | Right-sized pool + transaction-mode pooler |
| CPU-heavy work on the request thread (blocks event loop) | Offload to a worker / background job |
| Load test = flat hammer | Realistic ramp/sustained/spike scenarios |
| "Slow over time" treated as CPU when it's a leak | Heap snapshot diff between two timepoints; track RSS trend |
| Event-loop lag unmonitored (silent stalls) | Track eventLoopUtilization / GC pause % with alert thresholds |
Severity tiers
| Tier | Examples | Action SLA |
|---|---|---|
| Critical | Connection pool exhaustion causing outages; p99 > SLO by multiples; event loop blocked by sync CPU work | Block release; fix immediately |
| Major | No load test before a traffic event; pool size guessed; tail latency unmeasured | Fix this sprint |
| Minor | Load test not in CI; minor GC tuning opportunity | Schedule within 2 sprints |
Completion Criteria
- Bottleneck identified with metrics/profiler (not guessed)
- Connection pool right-sized + pooler in place
- Load test defines + passes p95/p99 + error-rate thresholds
- CPU hot paths offloaded or optimized
- Load test gated in CI
Output
- Load test scripts: k6 with SLO thresholds (exit-code gating)
- Profiling report:
docs/perf-profile-YYYY-MM-DD.md— bottleneck, fix, before/after p95/p99 - Commit format:
perf(backend): right-size connection pool/perf(backend): offload <work> to worker
Implementation
TypeScript + Node + Postgres (default)
- Pooling: PgBouncer or Supabase pooler (transaction mode); Prisma connection limit tuned to pool
- Load test: Grafana k6 —
thresholds: { http_req_duration: ['p(95)<300','p(99)<800'] }; CI fails on breach (exit 99) - Profiling:
node --prof/0xfor flame graphs (note: clinic.js is no longer actively maintained — use node --prof / 0x / k6) - Watch: event-loop lag (don't block it with sync CPU)
Other stacks
- Python / FastAPI:
py-spy/cProfileflame graphs; gunicorn/uvicorn worker tuning; pgbouncer - Go:
pprof(built-in, excellent); goroutine/heap profiles;database/sqlSetMaxOpenConns - Universal: k6 (Go engine, JS scripts) is stack-agnostic for load testing; p95/p99 + pooling are universal; flame graphs exist for every runtime
Related skills
query-optimization— DB query time is usually the top bottleneckcaching-strategy— cache after optimizing, to shed read loadobservability-setup— metrics tell you WHERE to profile
Reference
- Key insight encoded: Define SLO-aligned
p(95)/p(99)thresholds in the load test so a breach fails CI automatically — tail latency, not averages, defines API health. - Caveat: clinic.js is no longer actively maintained — cite for flame-graph concepts but recommend
node --prof/0x/ k6 as the live toolchain.