performance-profiling - SKILL.md Agent Skill

name: performance-profiling description: Find and fix backend bottlenecks — connection pooling, p95/p99 latency, load testing with SLO-aligned thresholds (k6), and CPU profiling (flame graphs). Use when latency is high, throughput plateaus, or before scaling traffic. Not for DB query specifics (use query-optimization) or read-load shedding (use caching-strategy). license: MIT

Performance Profiling

Purpose

Locate the actual backend bottleneck with data (not intuition), fix it, and prove the fix with load tests gated on tail latency — so the system scales predictably.

Universal — measure-first profiling, p95/p99 over averages, connection pooling, and SLO-gated load testing are backend-perf principles; the profiler/load tool differs.

Procedure

Measure before optimizing — find the real bottleneck
- Use observability metrics (observability-setup) to see WHERE time goes: DB? external call? CPU? lock contention?
- The bottleneck is usually DB query time (→ query-optimization), not app CPU — confirm before profiling app code
Right-size connection pooling
- DB connections are scarce; an unbounded/oversized pool exhausts the DB, an undersized one starves the app
- Use a transaction-mode connection pooler for serverless/high-concurrency (specific poolers in Implementation)
- Pool size ≈ based on DB max connections ÷ instances, not "bigger is better"
Load test with SLO-aligned thresholds
- Define p(95) / p(99) latency + error-rate thresholds IN the test
- Tail latency (p99), not average, defines API health — averages hide the slow requests users feel
- Test realistic scenarios (ramp-up, sustained, spike), not just a flat hammer
CPU profiling when app code is the bottleneck
- Capture a flame graph; find the hot path (sync work blocking the event loop in Node, GIL contention in Python)
- Move CPU-heavy work off the request path (worker / background-jobs) or optimize the hot function

4b. Memory profiling — leaks and unbounded growth

"Slow over time" is often a leak, not a CPU issue: take two heap snapshots at different runtimes and diff (retainers tab in Chrome DevTools / node --heapsnapshot-signal)
Watch RSS / heapUsed trends in observability; a steady climb between deploys = leak
Common Node leaks: event-listener accumulation, closures over large objects in a module-level cache, unbounded Map/array

4c. Event-loop lag as a first-class signal (Node-specific, conceptually general)

The event loop blocks → every concurrent request stalls (not just the slow one). Monitor eventLoopUtilization() / monitorEventLoopDelay and alert at threshold
Equivalent: GC pause % (any runtime), garbage-collection time per minute

Address tail latency causes
- GC pauses, cold caches, connection acquisition waits, N+1 (→ query-optimization), lock contention (→ transaction-management)
Validate (validation loop)
- Re-run the load test after each fix; if p99 still breaches the SLO threshold → the bottleneck moved; re-profile and fix the new top contributor
- Gate the load test in CI (fail on threshold breach) so regressions are caught

Anti-patterns

❌ Anti-pattern	✅ Correct
Optimizing by guessing	Profile/measure first; fix the proven bottleneck
Tuning on average latency	Gate on p95/p99 tail latency
Unbounded / "bigger" connection pool	Right-sized pool + transaction-mode pooler
CPU-heavy work on the request thread (blocks event loop)	Offload to a worker / background job
Load test = flat hammer	Realistic ramp/sustained/spike scenarios
"Slow over time" treated as CPU when it's a leak	Heap snapshot diff between two timepoints; track RSS trend
Event-loop lag unmonitored (silent stalls)	Track `eventLoopUtilization` / GC pause % with alert thresholds

Severity tiers

Tier	Examples	Action SLA
Critical	Connection pool exhaustion causing outages; p99 > SLO by multiples; event loop blocked by sync CPU work	Block release; fix immediately
Major	No load test before a traffic event; pool size guessed; tail latency unmeasured	Fix this sprint
Minor	Load test not in CI; minor GC tuning opportunity	Schedule within 2 sprints

Completion Criteria

Bottleneck identified with metrics/profiler (not guessed)
Connection pool right-sized + pooler in place
Load test defines + passes p95/p99 + error-rate thresholds
CPU hot paths offloaded or optimized
Load test gated in CI

Output

Load test scripts: k6 with SLO thresholds (exit-code gating)
Profiling report: docs/perf-profile-YYYY-MM-DD.md — bottleneck, fix, before/after p95/p99
Commit format: perf(backend): right-size connection pool / perf(backend): offload <work> to worker

Implementation

TypeScript + Node + Postgres (default)

Pooling: PgBouncer or Supabase pooler (transaction mode); Prisma connection limit tuned to pool
Load test: Grafana k6 — thresholds: { http_req_duration: ['p(95)<300','p(99)<800'] }; CI fails on breach (exit 99)
Profiling: node --prof / 0x for flame graphs (note: clinic.js is no longer actively maintained — use node --prof / 0x / k6)
Watch: event-loop lag (don't block it with sync CPU)

Other stacks

Python / FastAPI: py-spy / cProfile flame graphs; gunicorn/uvicorn worker tuning; pgbouncer
Go: pprof (built-in, excellent); goroutine/heap profiles; database/sql SetMaxOpenConns
Universal: k6 (Go engine, JS scripts) is stack-agnostic for load testing; p95/p99 + pooling are universal; flame graphs exist for every runtime

Related skills

query-optimization — DB query time is usually the top bottleneck
caching-strategy — cache after optimizing, to shed read load
observability-setup — metrics tell you WHERE to profile

Reference

Key insight encoded: Define SLO-aligned p(95)/p(99) thresholds in the load test so a breach fails CI automatically — tail latency, not averages, defines API health.
Caveat: clinic.js is no longer actively maintained — cite for flame-graph concepts but recommend node --prof / 0x / k6 as the live toolchain.