performance-profiling

star 0

Find and fix backend bottlenecks — connection pooling, p95/p99 latency, load testing with SLO-aligned thresholds (k6), and CPU profiling (flame graphs). Use when latency is high, throughput plateaus, or before scaling traffic. Not for DB query specifics (use query-optimization) or read-load shedding (use caching-strategy).

JayKim88 By JayKim88 schedule Updated 6/8/2026

name: performance-profiling description: Find and fix backend bottlenecks — connection pooling, p95/p99 latency, load testing with SLO-aligned thresholds (k6), and CPU profiling (flame graphs). Use when latency is high, throughput plateaus, or before scaling traffic. Not for DB query specifics (use query-optimization) or read-load shedding (use caching-strategy). license: MIT

Performance Profiling

Purpose

Locate the actual backend bottleneck with data (not intuition), fix it, and prove the fix with load tests gated on tail latency — so the system scales predictably.

Universal — measure-first profiling, p95/p99 over averages, connection pooling, and SLO-gated load testing are backend-perf principles; the profiler/load tool differs.

Procedure

  1. Measure before optimizing — find the real bottleneck

    • Use observability metrics (observability-setup) to see WHERE time goes: DB? external call? CPU? lock contention?
    • The bottleneck is usually DB query time (→ query-optimization), not app CPU — confirm before profiling app code
  2. Right-size connection pooling

    • DB connections are scarce; an unbounded/oversized pool exhausts the DB, an undersized one starves the app
    • Use a transaction-mode connection pooler for serverless/high-concurrency (specific poolers in Implementation)
    • Pool size ≈ based on DB max connections ÷ instances, not "bigger is better"
  3. Load test with SLO-aligned thresholds

    • Define p(95) / p(99) latency + error-rate thresholds IN the test
    • Tail latency (p99), not average, defines API health — averages hide the slow requests users feel
    • Test realistic scenarios (ramp-up, sustained, spike), not just a flat hammer
  4. CPU profiling when app code is the bottleneck

    • Capture a flame graph; find the hot path (sync work blocking the event loop in Node, GIL contention in Python)
    • Move CPU-heavy work off the request path (worker / background-jobs) or optimize the hot function

4b. Memory profiling — leaks and unbounded growth

  • "Slow over time" is often a leak, not a CPU issue: take two heap snapshots at different runtimes and diff (retainers tab in Chrome DevTools / node --heapsnapshot-signal)
  • Watch RSS / heapUsed trends in observability; a steady climb between deploys = leak
  • Common Node leaks: event-listener accumulation, closures over large objects in a module-level cache, unbounded Map/array

4c. Event-loop lag as a first-class signal (Node-specific, conceptually general)

  • The event loop blocks → every concurrent request stalls (not just the slow one). Monitor eventLoopUtilization() / monitorEventLoopDelay and alert at threshold
  • Equivalent: GC pause % (any runtime), garbage-collection time per minute
  1. Address tail latency causes

    • GC pauses, cold caches, connection acquisition waits, N+1 (→ query-optimization), lock contention (→ transaction-management)
  2. Validate (validation loop)

    • Re-run the load test after each fix; if p99 still breaches the SLO threshold → the bottleneck moved; re-profile and fix the new top contributor
    • Gate the load test in CI (fail on threshold breach) so regressions are caught

Anti-patterns

❌ Anti-pattern ✅ Correct
Optimizing by guessing Profile/measure first; fix the proven bottleneck
Tuning on average latency Gate on p95/p99 tail latency
Unbounded / "bigger" connection pool Right-sized pool + transaction-mode pooler
CPU-heavy work on the request thread (blocks event loop) Offload to a worker / background job
Load test = flat hammer Realistic ramp/sustained/spike scenarios
"Slow over time" treated as CPU when it's a leak Heap snapshot diff between two timepoints; track RSS trend
Event-loop lag unmonitored (silent stalls) Track eventLoopUtilization / GC pause % with alert thresholds

Severity tiers

Tier Examples Action SLA
Critical Connection pool exhaustion causing outages; p99 > SLO by multiples; event loop blocked by sync CPU work Block release; fix immediately
Major No load test before a traffic event; pool size guessed; tail latency unmeasured Fix this sprint
Minor Load test not in CI; minor GC tuning opportunity Schedule within 2 sprints

Completion Criteria

  • Bottleneck identified with metrics/profiler (not guessed)
  • Connection pool right-sized + pooler in place
  • Load test defines + passes p95/p99 + error-rate thresholds
  • CPU hot paths offloaded or optimized
  • Load test gated in CI

Output

  • Load test scripts: k6 with SLO thresholds (exit-code gating)
  • Profiling report: docs/perf-profile-YYYY-MM-DD.md — bottleneck, fix, before/after p95/p99
  • Commit format: perf(backend): right-size connection pool / perf(backend): offload <work> to worker

Implementation

TypeScript + Node + Postgres (default)

  • Pooling: PgBouncer or Supabase pooler (transaction mode); Prisma connection limit tuned to pool
  • Load test: Grafana k6thresholds: { http_req_duration: ['p(95)<300','p(99)<800'] }; CI fails on breach (exit 99)
  • Profiling: node --prof / 0x for flame graphs (note: clinic.js is no longer actively maintained — use node --prof / 0x / k6)
  • Watch: event-loop lag (don't block it with sync CPU)

Other stacks

  • Python / FastAPI: py-spy / cProfile flame graphs; gunicorn/uvicorn worker tuning; pgbouncer
  • Go: pprof (built-in, excellent); goroutine/heap profiles; database/sql SetMaxOpenConns
  • Universal: k6 (Go engine, JS scripts) is stack-agnostic for load testing; p95/p99 + pooling are universal; flame graphs exist for every runtime

Related skills

  • query-optimization — DB query time is usually the top bottleneck
  • caching-strategy — cache after optimizing, to shed read load
  • observability-setup — metrics tell you WHERE to profile

Reference

  • Key insight encoded: Define SLO-aligned p(95)/p(99) thresholds in the load test so a breach fails CI automatically — tail latency, not averages, defines API health.
  • Caveat: clinic.js is no longer actively maintained — cite for flame-graph concepts but recommend node --prof / 0x / k6 as the live toolchain.
Install via CLI
npx skills add https://github.com/JayKim88/claude-ai-engineering --skill performance-profiling
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator