name: dev:tool-perf description: > Performance engineering workflow grounded in MIT 6.172 (Leiserson/Shun, fully available on MIT OCW) and "Systems Performance" (Brendan Gregg, Addison-Wesley 2020). Covers USE Method (Utilization/Saturation/Errors), flamegraph analysis, bottleneck identification, Bentley Rules, and measurable before/after benchmarking. Use this skill whenever the user has a performance problem, slow code, high latency, CPU/memory spikes, or wants to optimize a system. Trigger: "performance problem", "slow API", "high latency", "CPU spike", "memory leak", "analyze flamegraph", "find bottleneck", "optimize code", "profiling", "p99 too high", "increase throughput", "USE Method", "why is this so slow", "load test shows problems". Covers: USE Method, flamegraph analysis, CPU/memory/I-O/network profiling, Bentley Rules, measurement and benchmarking.
Performance Engineering (tool-perf)
Structured performance analysis workflow — from symptom description to verified fix. Grounded in MIT 6.172 and Brendan Gregg "Systems Performance".
Core Philosophy (MIT 6.172 + Brendan Gregg)
"Measure, don't guess." — Core principle of performance engineering
"Never tune for performance without first having a performance target." — Brendan Gregg, Systems Performance
Always measure before optimizing. Without profiling data, optimization is guessing. USE Method provides the systematic framework — flamegraphs show WHERE time is spent.
Step 0 — Clarify Scope and Symptom
Questions:
- What is the concrete symptom? (High latency / high CPU / memory growth / low throughput)
- Stack: web service / CLI / database / low-level C/C++ / JVM / Python?
- Are there metrics? (Prometheus, Datadog, APM tool, when did it get worse?)
- Is there a performance target? (p99 < 200ms, throughput > 1000 req/s)
- Production problem or pre-release optimization?
→ Without concrete numbers and a target, performance work cannot be completed.
Step 1 — USE Method: Locate the Bottleneck
(→ references/use-method.md for complete resource checklist)
Measure for each relevant resource:
- Utilization: How many % of capacity is being used?
- Saturation: Are queues forming?
- Errors: Are there error events?
Quick diagnosis by symptom:
| Symptom | Suspicion | Next step |
|---|---|---|
| High latency, low CPU | I/O-bound | Disk queue, network drops, DB latency |
| High CPU, scales with traffic | CPU-bound | CPU profiling, flamegraph |
| Requests queueing | Concurrency limit | Thread/worker pool saturation |
| Memory growing over time | Memory leak | Heap dump, allocation profiling |
| Sporadic spikes | Lock contention or GC | Mutex saturation, GC logs |
→ Identify the most likely bottleneck, then analyze deeper.
Step 2 — Profiling and Flamegraph
2a — Choose profiling tool (by stack)
| Stack | CPU Profiler | Memory Profiler |
|---|---|---|
| Linux (C/C++/Go) | perf record + perf report |
Valgrind, heaptrack |
| JVM (Java/Kotlin/Scala) | async-profiler, JFR | JVM Heap Dump + Eclipse MAT |
| Python | py-spy, cProfile |
memory-profiler, Tracemalloc |
| Node.js | --prof, Clinic.js |
--inspect + Chrome DevTools |
| Go | pprof |
pprof Heap Profile |
| Browser (JS) | Chrome DevTools Performance | Chrome DevTools Memory |
2b — Create and read flamegraph
perf record -F 99 -g -- <command>
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
Reading a flamegraph:
- X-axis = share of total sampling time (not time progression)
- Y-axis = call stack (bottom = entry point, top = where time is spent)
- Width of a block = % time in that function + callees
- Plateau (wide flat block at top) = hotspot — optimize there
2c — Identify top hotspot
Locate the widest block at the very top of the flamegraph. That is the function consuming the most CPU time.
Step 3 — Bottleneck Analysis
CPU-bound:
- Algorithm complexity (O(n²) where O(n log n) possible?)
- Apply Bentley Rules (→
references/bentley-rules.md) - Cache efficiency: check memory access pattern (row-major vs column-major)
- Parallelization: single-threaded bottleneck → multi-threading
Memory-bound:
- Heap profiling: which allocations dominate?
- Reduce GC pressure: object pooling, fewer short-lived objects
- Memory leak: which objects grow over time?
I/O-bound:
- DB queries: EXPLAIN ANALYZE, missing indexes, N+1 queries
- Disk: async I/O, batching, sequential instead of random access
- Network: connection pooling, keep-alive, compression
Lock contention:
- Minimize critical section
- Lock-free data structures where possible
- Read-write lock instead of exclusive mutex
Step 4 — Optimization with Bentley Rules
(→ references/bentley-rules.md for complete rules)
Before optimizing: Record baseline benchmark (exact numbers).
Priority by impact:
- Algorithm/data structure — greatest impact (O(n²) → O(n log n))
- Caching — expensive computations done once
- Loop optimizations — hoisting, fusion, early exit
- Memory access patterns — cache friendliness
Never optimize without measurement (premature optimization).
Step 5 — Measurement and Verification
5a — Record benchmark
Always: measure before optimization, measure after optimization. Same conditions, sufficient runs (separate warmup + measurement phase).
5b — Benchmark tools
| Type | Tool |
|---|---|
| Microbenchmark | JMH (Java), criterion (Rust), benchmark (Go), pytest-benchmark |
| HTTP load test | k6, wrk, hey, Apache Bench |
| Profiler-based | Compare flamegraph before/after |
| System level | perf stat (CPU cycles, cache misses, instructions) |
5c — Document results
| Metric | Before | After | Improvement |
|---|---|---|---|
| p50 latency | X ms | Y ms | Z% |
| p99 latency | X ms | Y ms | Z% |
| Throughput (req/s) | X | Y | Z% |
| CPU (%) | X | Y | Z% |
Output — perf-findings.md
# Performance Findings — [Service/Component]
## Symptom
[Concrete description: p99 latency 800ms, target < 200ms]
## USE Method Results
| Resource | U | S | E | Finding |
|---|---|---|---|---|
| CPU | 85% | Load 4 (2 Cores) | 0 | CPU-bound |
| Memory | 60% | 0 | 0 | OK |
## Root Cause
[Flamegraph shows: 73% time in `parseJson()` — inefficient parser]
## Action
[Switch to faster JSON parser, cache result]
## Measurement Results
| Metric | Before | After |
|---|---|---|
| p99 latency | 800ms | 120ms |
| CPU | 85% | 35% |
Reference Files
references/curriculum-mapping.md— Concept → MIT 6.172 lecture + Brendan Gregg chapterreferences/use-method.md— Resource checklist (U/S/E), quick diagnosis, p99 vs. averagereferences/bentley-rules.md— Data structure / logic / loop / memory optimization rules