perf-engineering - SKILL.md Agent Skill

name: perf-engineering description: > Performance engineering guidance for CPU and memory optimization across languages (Rust, C/C++, TypeScript/JavaScript, Go, Python). Use this skill whenever the user asks about optimizing code performance, reducing memory allocations, improving cache locality, SIMD/vectorization, profiling, benchmarking, or any question about making code faster or more memory-efficient. Also trigger when the user mentions: hot loops, allocation pressure, cache misses, false sharing, memory pools, arena allocation, string interning, branch prediction, auto-vectorization, zero-copy, AoS vs SoA, data-oriented design, or profiling tools (perf, flamegraph, Instruments, VTune, cachegrind). Trigger even for indirect performance questions like "why is this slow", "this function is a bottleneck", "how to reduce memory usage", or "should I optimize this".

Performance Engineering: CPU & Memory Optimization

This skill provides structured guidance for performance optimization decisions. It covers the ten core areas below and helps you apply them to real code.

Philosophy: Measure First, Optimize Second

The single most important rule: never optimize without profiling data. Intuition about performance is notoriously unreliable. A function that "looks slow" might account for 0.1% of runtime. Always start with a profiler, identify the actual bottleneck, then apply the appropriate technique from this skill.

The goal is not to make every line of code fast — it's to make the right code fast, in the right way, with the minimum complexity cost.

How to Use This Skill

Identify the bottleneck category — is it CPU-bound (computation, branch mispredicts, poor vectorization) or memory-bound (allocations, cache misses, layout issues)?
Read the relevant reference file(s) from references/ for deep guidance
Apply the technique with the language-specific patterns provided
Verify the improvement with benchmarks before and after

Reference Files

Load the appropriate reference based on the optimization area:

references/memory-optimization.md — Stack vs heap, memory pooling, arena allocation, string interning, zero-copy techniques. Read this for anything related to reducing allocations, memory layout, or memory bandwidth.
references/cpu-optimization.md — Cache locality (AoS vs SoA), false sharing, branch prediction, SIMD/vectorization, loop optimization, lazy evaluation. Read this for anything related to making computation faster.
references/profiling-guide.md — When NOT to optimize, profiling tools by language/platform, benchmarking methodology, interpreting results, common pitfalls. Read this FIRST if the user hasn't profiled yet.

Decision Flowchart

When someone asks "how do I make this faster":

Have they profiled? → If no, load references/profiling-guide.md and help them profile first
Is the bottleneck allocation-heavy? → Load references/memory-optimization.md
Is the bottleneck compute-heavy? → Load references/cpu-optimization.md
Is it both? → Load both references
Are they designing a new system? → Load both references for upfront architecture guidance

Quick Reference: When Each Technique Matters

Technique	When It Helps	When It Doesn't
Stack allocation	Tight loops, small fixed-size data	Large/dynamic-sized data
Arena allocation	Many short-lived objects with shared lifetime	Long-lived objects with varied lifetimes
String interning	Repeated string comparisons, deduplication	Unique strings, write-heavy workloads
AoS → SoA	Iterating one field across many entities	Accessing all fields of one entity
Memory pooling	Frequent alloc/dealloc of same-sized objects	Varied-size allocations
SIMD	Uniform operations on contiguous data	Branchy, data-dependent logic
Branch prediction	Hot loops with predictable patterns	Cold code, unpredictable data
Zero-copy	Large buffers passed between stages	Small data, ownership semantics needed
Lazy evaluation	Expensive computation that's often skipped	Always-needed values (adds overhead)
False sharing fixes	Multithreaded counters/accumulators	Single-threaded or read-only shared data

Keywords

performance, optimization, profiling, benchmark, cache, SIMD, vectorization, allocation, memory pool, arena, false sharing, AoS, SoA, data-oriented design, zero-copy, branch prediction, string interning, flamegraph, perf, hot loop, bottleneck, latency, throughput, memory bandwidth