name: analyze-nsys-profile
description: Query a captured PithTrain Nsight Systems profile to measure compute/communication overlap, locate exposed comm by DualPipeV stage, and inspect per-rank stream behavior. Use when the user asks to "analyze an nsys profile", "check overlap quality", "find exposed comm", "which stage is the bottleneck", or any question that starts from an existing .nsys-rep file. Assumes the trace was already captured (see capture-nsys-profile); provides query primitives the agent composes for the specific question being asked.
Analyze Nsys Profile
A passive query toolkit for PithTrain nsys traces. The agent asks a specific question; the skill provides primitives that answer it fast and correctly. The skill does not produce an unsolicited full report. It expects the agent to compose the right query for the question being asked.
Prerequisites
- A captured
.nsys-repexists (default location:workspace/capture-nsys-profile/pithtrain_node*.nsys-rep). - The repo venv is active:
source .venv/bin/activate. nsysCLI onPATH(for the one-time SQLite export).
Step 1 — Export the trace to SQLite
nsys export --type=sqlite --force-overwrite=true --output=workspace/capture-nsys-profile/pithtrain_node0.sqlite workspace/capture-nsys-profile/pithtrain_node0.nsys-rep
All subsequent queries hit the SQLite, not the raw .nsys-rep.
Step 2 — Shared preparation (always run first)
Three primitives establish who, when, and what — every downstream analysis depends on the data they surface. Run (or at least understand the output of) all three before reaching for the analysis scripts below.
| Question | Primitive |
|---|---|
| What ranks are in this trace? What's the per-rank setup? | show_setup.py |
| What's the steady-state analysis window for each rank? | find_window.py |
| Which streams are compute / comm, and what's each comm stream's purpose? | classify_streams.py |
Pipeline: show_setup → find_window → classify_streams. show_setup gives you the mapping pid ↔ rank ↔ mesh coordinates; find_window picks the median DualPipeV chunk per rank (deterministic across re-runs, so before/after comparisons are valid); classify_streams identifies which CUDA streams in that window are compute vs comm, and labels the comm streams' purpose (ep_a2a, cp_ring, pp_p2p).
Step 3 — Measure overlap per DualPipeV stage
python .agents/skills/analyze-nsys-profile/scripts/compute_overlap.py workspace/capture-nsys-profile/pithtrain_node0.sqlite
Emits one row per (rank, stage) with columns: pid | stage | exposed_ns | overlap | overlap_min | overlap_max. The overlap column is the time-weighted hidden fraction across the stage's comm kernels; overlap_min / overlap_max are the extremes of the per-kernel overlap percentage and surface whether the stage is uniformly bad or bimodal.
See references/examples.md for recipes that compose this with the Step 2 primitives.
Critical conventions
Before composing a custom SQL query, read references/conventions.md. Highlights:
pid(Linux PID) is the per-rank join key, extracted exactly as the nsys docs prescribe:globalPid / 0x1000000 % 0x1000000(kernel rows) ==globalTid / 0x1000000 % 0x1000000(NVTX rows). Single SQLite per node → PIDs unique within a trace.- Always filter
start >= 0— pre-cudaProfilerStartNCCL init ranges have negative timestamps. - Per-rank setup label is the first in-window NVTX event per rank:
rank=N; pp=R/S dp=R/S cp=R/S ep=R/S; mbs=M seq=Q. - Chunk anchor for steady state: the median-indexed
forward chunk X (phaseY) backward chunk Z (phaseW)NVTX range emitted by DualPipeV. Match withLIKE 'forward chunk%backward chunk%'to disambiguate from per-stage forward markers. - Compute-vs-comm classification: a kernel is communication if its short name starts with
nccl, otherwise compute. A stream is a comm stream iff every one of its kernels is NCCL. - Comm-stream purpose is discovered from the PithTrain stage-NVTX (
layer*.stageN_*) enclosing each kernel at its CPU-side launch time — every kernel must agree on the label (unanimity), otherwisemixed.
Non-fragile classification rules
Avoid these heuristics — they break across configs:
- "Stream with > N kernels of type X is comm" (N depends on layer count, chunks, seq length).
- "Kernel duration > T µs means data movement" (long duration can also be a straggler wait).
- "Stream with avg µs < threshold is EP" (depends on token volume per rank).
Use these instead:
- One-sided purity check for compute-vs-comm streams.
- NVTX-context labeling for stream purpose: look up the innermost PithTrain stage range enclosing each kernel and require unanimous agreement. Implemented in
classify_streams.py.
Worked examples
See references/examples.md for recipe-style answers to:
- How well is each EP phase overlapped with compute?
- Which EP phase has the worst overlap?
- Are the PP stages balanced?
- Which (rank, stage) carries the most exposed comm?
Output guidance
- Scripts emit a fixed-width table to stdout. One column per record field; agents read it directly.
- Cross-script joins are by
pid— every script's table includes pid as the per-rank identifier; downstream rows compose againstshow_setup's mappingpid ↔ rank ↔ setup. - When reporting to the human user, summarize as plain prose with a small table extracted from the relevant columns.
- Always cite the analysis window. An overlap percentage with no window is meaningless.
Gotchas (surfaced by prior agent runs)
classify_streams.pyonly reports streams active in the analysis window, not every stream that exists in the trace. A rank typically has 6-8 streams overall but only 2-3 inside a single steady-state chunk. This is intentional — analyzing a small window does not need the inactive streams.- PP P2P kernels rarely appear in a single chunk window — they fire between chunks. Widen the window (
--start NS --end NSon the analysis script) if you specifically want to see the PP P2P comm stream. compute_overlap.py's percent cells include a trailing%(58.4%, not0.584). Sort/compare numerically by stripping the%first. Absolute time columns (exposed_ns) are bare integer nanoseconds.- CPU launch time vs GPU execution time — for any NVTX-context lookup on a kernel, use
kernel["launch_start"](CPU-sidecudaLaunchKerneltime) rather thankernel["start"](GPU-side execution time). All scripts already do this; if you write an ad-hoc query, callcommon.innermost_nvtxon launch_start values. - Comm-stream purpose uses unanimity, not majority — every kernel on the stream must agree on its enclosing-NVTX category, otherwise the label is
mixed. A single mis-categorized kernel surfaces asmixedinstead of silently being out-voted.
Common Issues
no such table: NVTX_EVENTS
The .nsys-rep has not been exported yet. Run the nsys export command from Step 1.
PP P2P comm is missing from the overlap output
By design. compute_overlap.py buckets kernels by their enclosing PithTrain stage NVTX (stage1_* through stage5_*); PP P2P kernels live inside the pipeline send/recv wrapper, which is not a stage marker, so they are filtered out. Widen the window with --start NS --end NS if you need to investigate them — they typically fire between chunks, not inside.
Negative timestamps on NVTX events
NCCL init opened these ranges before cudaProfilerStart. Filter WHERE start >= 0 to scope to the profiled window.