name: linux-perf description: > Linux perf performance analysis expert covering the complete workflow from environment setup through data collection, analysis, and bottleneck diagnosis. Expert in CPU profiling, call graph analysis, syscall tracing, and performance optimization. Use this skill whenever profiling C/C++/Rust/Go applications, diagnosing CPU bottlenecks, analyzing syscall overhead, or optimizing hot paths. IMPORTANT: Check permissions first. User-space profiling works without sudo, but kernel symbols require sudo to modify kptr_restrict and perf_event_paranoid. Focus on user-space optimization rather than kernel.
Linux Perf Performance Analysis Skill
This skill provides expert guidance for using Linux perf tool to profile and
optimize application performance. Focuses on practical, non-sudo workflows that
work in development and CI/CD environments.
Quick Reference Commands
Environment Setup
User-space profiling works without sudo (after one-time capability setup):
perf record -p <PID>- Record user-space samplesperf report -i perf.data --dsos <app>- Analyze user-space code
Kernel symbols require sudo (security mechanism, cannot bypass):
- Need sudo to modify:
kptr_restrictandperf_event_paranoid - Without sudo: Kernel shows as
[unknown]or hex addresses
# Check current permissions
cat /proc/sys/kernel/perf_event_paranoid # -1=unrestricted, 0-2=restricted, 3+=no kernel
cat /proc/sys/kernel/kptr_restrict # 0=full access, 1=restricted, 2=no access
# One-time capability setup (requires sudo once)
sudo setcap "cap_perfmon,cap_sys_admin,cap_sys_ptrace+ep" /usr/bin/perf
# Enable full kernel access (requires sudo OR run setup script)
sudo sysctl -w kernel.perf_event_paranoid=-1 kernel.kptr_restrict=0
# Make permanent (requires sudo once)
echo "kernel.perf_event_paranoid = -1" | sudo tee -a /etc/sysctl.d/99-perf.conf
echo "kernel.kptr_restrict = 0" | sudo tee -a /etc/sysctl.d/99-perf.conf
# No external script needed - commands are included in this skill file!
Data Collection
# Record CPU samples with call graphs
perf record -p <PID> -g -e cycles -o perf.data -- sleep 30
# With custom sampling frequency (default is 4000Hz)
perf record -p <PID> -g -F 99 -o perf.data -- sleep 30
# Record specific functions
perf record -p <PID> -g -e 'cpu-cycles' -e 'instructions' -o perf.data -- sleep 30
Data Analysis
# Interactive report (best for exploration)
perf report -i perf.data -g graph
# Flat profile (top hotspots)
perf report --stdio -i perf.data -g none --sort symbol | head -50
# Call graph with depth limit
perf report --stdio -i perf.data -g graph --max-stack 20 | head -200
# Filter by module/library
perf report --stdio -i perf.data --dsos <app_name> -g graph
# Filter by specific function
perf report --stdio -i perf.data --symbol-filter 'tcp_sendmsg' -g graph
Data Export
# Export as text for analysis
perf script -i perf.data > perf_script.txt
# Export with statistics
perf report --stdio -i perf.data --show-total-period --show-nr-samples
Complete Workflow Phases
Phase 1: Environment Preparation
1.1 Permission Check
# Check current settings
cat /proc/sys/kernel/perf_event_paranoid
# -1 = unrestricted (best)
# 0-1 = partial kernel access
# 2 = kernel profiling blocked
# 3+ = very restricted
cat /proc/sys/kernel/kptr_restrict
# 0 = full kernel symbols (best)
# 1 = addresses hidden
# 2 = all hidden
1.2 Full Kernel Access Setup (Requires sudo - one time only)
Run this once to enable full kernel symbol support:
# One-time setup (you can copy-paste these commands yourself):
sudo setcap "cap_perfmon,cap_sys_admin,cap_sys_ptrace+ep" /usr/bin/perf
sudo sysctl -w kernel.perf_event_paranoid=-1 kernel.kptr_restrict=0
echo "kernel.perf_event_paranoid = -1" | sudo tee /etc/sysctl.d/99-perf.conf > /dev/null
echo "kernel.kptr_restrict = 0" | sudo tee -a /etc/sysctl.d/99-perf.conf > /dev/null
# After this, perf works WITHOUT sudo for all future runs!
1.3 Without sudo - User-space Only Mode
If you cannot use sudo, you can still profile user-space code:
# Record
perf record -p <PID> -g -o perf.data -- sleep 30
# Analyze user-space only (filter to your app)
perf report --stdio -i perf.data --dsos yourapp -g graph
Trade-off: Kernel shows as [unknown] - but user-space profiling still works!
1.4 Compile with Debug Symbols
# CMake: RelWithDebInfo
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
# GCC/Clang: -g
gcc -g -O2 -o app app.c
# Rust: debug info
cargo build --release
# Edit Cargo.toml: [profile.release] debug = true
1.4 Performance Build Rules
- Always test without ASAN (2-5x overhead)
- Use RelWithDebInfo or -g (not -O0, not stripped)
- Verify symbols:
nm <app> | grep <function>orobjdump -t <app> | grep <function>
Phase 2: Data Collection
2.1 Basic Recording
# Find process PID
pidof <app_name>
# or
ps aux | grep <app_name>
# Start recording
perf record -p <PID> -g -o perf.data -- sleep 30
2.2 Recording During Load Test
# Terminal 1: Start perf recording
perf record -p <PID> -g -o perf.data -- sleep 30 &
# Terminal 2: Run load test
redis-benchmark -h 127.0.0.1 -p 6379 -t set,get -n 500000 -c 500 -q
# or wrk, ab, k6, etc.
# Wait for both to complete
wait
2.3 Advanced Recording Options
# Multiple events
perf record -p <PID> -g -e cycles,instructions,cache-misses -o perf.data -- sleep 30
# Custom frequency (99Hz is good for production, 997Hz for dev)
perf record -p <PID> -g -F 99 -o perf.data -- sleep 30
# With process name instead of PID
perf record -p $(pidof <app_name>) -g -o perf.data -- sleep 30
2.4 Key Parameters
| Parameter | Meaning | Typical Value |
|---|---|---|
-p <PID> |
Target process ID | $(pidof app) |
-g |
Record call graph | Always use |
-e cycles |
Sampling event | cycles, instructions, cache-misses |
-F 99 |
Sampling frequency (Hz) | 99 (prod), 997 (dev) |
-o perf.data |
Output file | perf.data |
-- sleep N |
Duration | 10-30 seconds |
Phase 3: Data Analysis
3.1 Interactive Analysis (Best First Step)
perf report -i perf.data -g graph
# Navigate: ↑/↓ (move), Enter (expand), + (collapse), q (quit)
3.2 Flat Profile (Top Hotspots)
# Top 50 functions by self time
perf report --stdio -i perf.data -g none --sort symbol | head -50
# With percentage and samples
perf report --stdio -i perf.data -g none -n --sort symbol | head -50
3.3 Call Graph Analysis
# Full call graph
perf report --stdio -i perf.data -g graph --max-stack 20 | head -200
# Trace specific function
perf report --stdio -i perf.data -g graph --symbol-filter 'tcp_sendmsg'
3.4 Module-Level Analysis
# By library/DSO
perf report --stdio -i perf.data --sort dso -g none
# User-space only (replace <app_name>)
perf report --stdio -i perf.data --dsos <app_name>,libc.so.6 -g graph
# Kernel-space only
perf report --stdio -i perf.data --dsos '[kernel]' -g graph
3.5 Statistical Reports
# With total period and sample counts
perf report --stdio -i perf.data --show-total-period --show-nr-samples
# Sorted by library then function
perf report --stdio -i perf.data --sort dso,symbol -g none
Phase 4: Bottleneck Diagnosis
4.1 Common Bottleneck Patterns
| Symptom | Likely Cause | Optimization Focus |
|---|---|---|
| High syscall % / entry_SYSCALL_64 | Frequent system calls | Reduce syscall count, batch operations |
| High futex_wait / futex_wake | Lock contention | Reduce lock scope, use lock-free algorithms |
| High tcp_sendmsg / __send | Small packets, frequent sends | Batching, buffering, zero-copy |
| High epoll_wait | I/O event processing | Event aggregation, async I/O |
| High memmove / memcpy | Memory copies | Reduce copies, use views/references |
| High operator new / malloc | Frequent allocation | Object pools, allocators, reuse |
| High schedule / __schedule | Context switches | Reduce thread count, affinity pinning |
| High strcmp / strlen | String operations | Use string views, pre-hash |
4.2 User vs Kernel Analysis
Rule: User space is the "cause", kernel space is the "effect"
# Check user vs kernel split
perf report --stdio -i perf.data --sort dso -g none | head -20
# If kernel is high (>50%), trace back to user-space calls
perf report --stdio -i perf.data -g graph --max-stack 30 | grep -A 5 -B 5 'kernel'
4.3 Hotspot Investigation Workflow
# 1. Identify top hotspot
perf report --stdio -i perf.data -g none --sort symbol | head -10
# 2. Check if it's your code (look for your namespace/prefix)
# 3. Get call stack for the hotspot
perf report --stdio -i perf.data -g graph --symbol-filter '<hotspot>'
# 4. Trace back to find the caller (your code)
# 5. Optimize the caller (reduce calls, batch, cache, etc.)
Phase 5: Comparative Analysis
5.1 ASAN vs Non-ASAN Comparison
# Record with ASAN
perf record -p <ASAN_PID> -g -o perf-asan.data -- sleep 15
# Record without ASAN
perf record -p <NORMAL_PID> -g -o perf-normal.data -- sleep 15
# Compare
perf diff perf-asan.data perf-normal.data
5.2 Load Comparison
# Record at different concurrency levels
for c in 100 500 1000; do
perf record -p $PID -g -o perf-$c.data -- sleep 10 &
redis-benchmark -c $c -t set,get -n 100000 -q
wait
done
# Compare
perf diff perf-100.data perf-500.data perf-1000.data
5.3 Optimization Validation
# Before optimization
perf record -p $PID_BEFORE -g -o before.data -- sleep 15
perf report --stdio -i before.data -g none > before.txt
# After optimization
perf record -p $PID_AFTER -g -o after.data -- sleep 15
perf report --stdio -i after.data -g none > after.txt
# Compare output manually or use diff
diff before.txt after.txt
Common Mistakes
Permission-Related Mistakes
- Running perf without setup → Always check
/proc/sys/kernel/perf_event_paranoidfirst - Using sudo for every command → Use
setcapone-time setup instead - Forgetting to chown perf.data → If you must use sudo, run
sudo chown $(whoami):$(whoami) perf.data
Data Collection Mistakes
- Sampling with ASAN enabled → ASAN adds 2-5x overhead, distorts results
- Using -O0 builds → Optimization changes hotspots, profile optimized code
- Stripped binaries → No symbols = no function names, useless data
- Too short recordings (< 5s) → Insufficient samples, statistical noise
- Recording idle processes → Record during actual load, not idle time
Analysis Mistakes
- Looking at Children% instead of Self% → Self% = actual CPU, Children% = callee cost
- Blaming kernel for slowness → Kernel overhead is symptom, optimize user-space callers
- Optimizing without measuring → Profile first, then optimize, then verify
- Focusing on micro-optimizations → Look for algorithmic/structural changes first
Analysis Best Practices
1. Always Use Self% for Prioritization
# Show self percentage
perf report --stdio -i perf.data -g none -n --sort symbol | head -20
# Prioritize functions with high Self% (direct CPU time)
# Ignore Children% (includes callees - misleading)
2. Follow the Call Chain
# Find a hotspot, then trace up the stack
perf report --stdio -i perf.data -g graph --max-stack 20 | less
# Use keyboard navigation:
# ↑/↓ : move cursor
# Enter : expand/collapse node
# + : expand all
# - : collapse all
# a : annotate source (if symbols available)
# q : quit
3. Focus on Your Code
# Filter to only show your application
perf report --stdio -i perf.data --dsos <your_app> -g graph
# Or filter out kernel/system
perf report --stdio -i perf.data --dsos='![kernel],[vdso]' -g graph
4. Validate with Multiple Runs
# Record 3 times to ensure consistency
for i in 1 2 3; do
perf record -p $PID -g -o run-$i.data -- sleep 10 &
load_test
wait
done
# Compare to ensure results are stable
Performance Tuning Checklist
Use this checklist when optimizing based on perf results:
- Profiled without ASAN/sanitizers
- Profiled optimized build (-O2/-O3/RelWithDebInfo)
- Debug symbols available (nm/objdump can find functions)
- Recording during actual load (not idle)
- Analyzed Self% (not Children%)
- Traced back to user-space caller
- Verified hotspot is reproducible
- Made code change
- Re-profiled to confirm improvement
- No regression in other metrics
Advanced Analysis Scenarios
CPU Cache Analysis
# Record cache events
perf record -p $PID -g -e cache-misses,cache-references -o cache.data -- sleep 15
# Analyze cache efficiency
perf report --stdio -i cache.data -g none --sort symbol | head -20
# Calculate cache miss rate
perf stat -e cache-misses,cache-references -p $PID -- sleep 10
Instruction Efficiency
# Record cycles vs instructions (IPC)
perf record -p $PID -g -e cycles,instructions -o ipc.data -- sleep 15
# Check IPC (Instructions Per Cycle)
perf report --stdio -i ipc.data -g none
# Live IPC measurement
perf stat -e cycles,instructions -p $PID -- sleep 10
# IPC > 2.0 is good, IPC < 1.0 indicates stalls
Context Switch Analysis
# Record context switches
perf record -p $PID -g -e context-switches,cs -o ctx.data -- sleep 15
# Analyze
perf report --stdio -i ctx.data -g none
Integration with Other Tools
Flame Graphs
# Generate flame graph
perf script -i perf.data | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
# Install tools
sudo apt-get install linux-tools-common linux-tools-generic
git clone https://github.com/brendangregg/FlameGraph
perf + GDB
# Find hotspot line
perf report --stdio -i perf.data -g graph | grep <function>
# Annotate source (needs debug symbols)
perf annotate -i perf.data <function>
# Or use perf report with 'a' key in interactive mode
perf + pprof
# Convert perf data to pprof format
perf script -i perf.data > perf.script
# Use with Go's pprof tool if available
FAQ
Q: Why do I need sudo for perf? A: Linux has two security mechanisms:
perf_event_paranoid: Controls event access (default: 2, restricted)kptr_restrict: Controls kernel symbol visibility (default: 1, restricted)setcaplets perf run without sudo for capability checks, but kernel symbols still need these settings modified.
Q: Can I profile without debug symbols?
A: Yes, but you'll only see addresses, not function names. Always compile with -g or RelWithDebInfo.
Q: Why does ASAN slow down my app so much? A: ASAN adds runtime checks that add 2-5x overhead. Profile non-ASAN builds for accurate performance data.
Q: Should I look at Self% or Children%? A: Always Self% - it's the actual CPU time spent in that function. Children% includes callee costs.
Q: Kernel shows as [unknown] or hex addresses, why? A: This is the expected behavior without sudo. You have two options:
- No sudo: Filter to user-space only (
--dsos <app>) - good for app-specific optimization - With sudo:
sudo sysctl -w kernel.perf_event_paranoid=-1 kernel.kptr_restrict=0- shows kernel symbols
Q: My app is 80% in kernel, what do I do?
A: Don't optimize kernel directly. Filter user-space (--dsos <app>) and trace back to find which user-space function is making all the syscalls, optimize that caller instead.
Q: How long should I record? A: 10-30 seconds for development, 60+ seconds for production-like loads. Longer is better for statistical significance.
Q: Can I record multiple processes?
A: Yes: perf record -p <PID1>,<PID2>,<PID3> -g -o perf.data -- sleep 30
Q: What if perf is not installed?
A: Install with: sudo apt-get install linux-tools-common linux-tools-$(uname -r) (Ubuntu/Debian)
Troubleshooting
perf: Permission denied
# Check permissions
cat /proc/sys/kernel/perf_event_paranoid
# If 1 or higher, use setcap
sudo setcap "cap_perfmon,cap_sys_admin,cap_sys_ptrace+ep" /usr/bin/perf
perf: No symbols found
# Check if binary is stripped
file <app>
# If "stripped", rebuild without strip
# Check for debug symbols
nm <app> | grep <function>
# If empty, compile with -g
perf: Operation not permitted
# Check capabilities
getcap /usr/bin/perf
# If empty, set capabilities
sudo setcap "cap_perfmon,cap_sys_admin,cap_sys_ptrace+ep" /usr/bin/perf
perf.data: Permission denied
# If you used sudo to record
sudo chown $(whoami):$(whoami) perf.data
Summary: The 5-Step Workflow
- Setup: Check permissions, use
setcap, compile with-g - Record:
perf record -p $PID -g -o perf.data -- sleep 30 - Analyze:
perf report -i perf.data -g graph(interactive first) - Diagnose: Focus on Self%, trace to user-space caller
- Validate: Re-profile after optimization to confirm improvement