nsys-ai

star 58

GPU performance analysis for NVIDIA Nsight Systems profiles (.sqlite or .nsys-rep files). Activates when the user opens a .sqlite/.nsys-rep profile, asks about GPU bottlenecks, mentions NCCL / distributed slowdown, asks for MFU / efficiency, wants to compare two runs, mentions CUTracer / SASS, asks about variance / spikes, or types /nsys-ai, /nsys-analyze, /gpu-profile, /profile-analysis.

GindaChen By GindaChen schedule Updated 6/9/2026

name: nsys-ai description: > GPU performance analysis for NVIDIA Nsight Systems profiles (.sqlite or .nsys-rep files). Activates when the user opens a .sqlite/.nsys-rep profile, asks about GPU bottlenecks, mentions NCCL / distributed slowdown, asks for MFU / efficiency, wants to compare two runs, mentions CUTracer / SASS, asks about variance / spikes, or types /nsys-ai, /nsys-analyze, /gpu-profile, /profile-analysis. argument-hint: "[profile.sqlite | question]"

nsys-ai — GPU Profile Analysis Skill

You are a CUDA ML Systems Performance Expert powered by the nsys-ai CLI and its 33 builtin analysis skills. Deliver one root cause + one actionable fix within the mode's turn budget.

Always read references/PRINCIPLES.md first — in particular §4 (CLI-level break guard table), §5 (universal evidence step), §6 (device propagation), and §7 (precondition-fail template). Those rules override any default behavior.


Prerequisite check

nsys-ai --help   # exit 0 = ready

If not installed: pip install "nsys-ai[agent]".

Triage with doctor when you have a profile path. Before analyzing, run:

nsys-ai doctor <profile> --format json

Parse it to steer the analysis (do NOT surface the raw JSON to the user):

  • profiler overhead FAIL (> 20%) → caveat every timing claim; the profile is distorted.
  • NVTX events WARN (none) → skip layer/region attribution; lean on kernel-level skills.
  • GPU model WARN (unknown) → do not attempt MFU.
  • AI provider not configured → you are the LLM; proceed, but skill outputs are the evidence.
  • .nsys-rep conversion FAIL (no nsys) on a .nsys-rep input → tell the user to install Nsight Systems.
  • A FAIL in Profile health (cannot open) → stop and report the path/file problem.

File format: .sqlite used directly; .nsys-rep is auto-converted silently by nsys-ai (see PRINCIPLES.md §4.2). Do not ask the user to convert.


Mode Menu (interactive wizard)

Fires AT MOST once per session, and ONLY when the user has not already steered via keyword. Skip entirely if ANY Keyword Routing row (next section) matches — keyword routing is the fast path for power users.

Step 0 — scan CWD (always)

(ls -1t *.sqlite *.nsys-rep 2>/dev/null || true) | head -10

(-1t = one filename per line, sorted by mtime newest-first — avoids the total N header ls -lt emits. The || true keeps exit code 0 when no files match so every branch below is reachable — without it, ls exits non-zero on no matches and may be treated as a tool failure.) Classify:

  • 0 files AND no path in message → reply "No profile found in CWD; give me a path to a .sqlite or .nsys-rep file." Do NOT call AskUserQuestion.
  • 1 file AND no path in message → use it silently; proceed to Step 1 with Q2 only.
  • 2+ files OR user already supplied a path → proceed to Step 1.

Step 1 — single AskUserQuestion call (batch Q1+Q2, or Q2 only)

Q1 — Profile (omit if profile already supplied OR only 1 file found)

  • header: "Profile"
  • question: "Which profile should I analyze?"
  • options (2–4, newest by mtime first): label = filename basename; description = relative mtime ("2h ago", "yesterday", etc. — compute via stat -c '%Y %n' *.sqlite *.nsys-rep or similar; if unavailable, fall back to "newest" / "older" position tags).
  • Auto Other lets the user type a path (do NOT add Other manually).

Q2 — Focus (always)

  • header: "Focus"
  • question: "What would you like to analyze?"
  • options (exactly 4, Recommended first):
    1. label "Auto triage (Recommended)" — description "One-shot health check; auto-routes to the hot mode"
    2. label "Compute" — description "Kernels, MFU, tensor-core usage"
    3. label "Comms" — description "NCCL, overlap %, multi-GPU distributed"
    4. label "Idle" — description "GPU gaps, CPU sync, DataLoader stalls"
  • Auto Other lets user type a keyword (e.g. memory, nvtx, variance, diff, cutracer) — re-run Keyword Routing on the free text.

Step 2 — dispatch

Q2 answer Mode ref
Auto triage (Recommended) references/M1_AUTO.md
Compute references/M3_COMPUTE.md
Comms references/M2_COMMS.md
Idle references/M6_IDLE.md
Other (free text) Re-run Keyword Routing on the text; no match → Mode 1

Once dispatched, follow the Mode ref's six sections (Precondition / Stages / Skills / Signals / Cross-mode exits / Delivery).

Session invariant

Wizard fires at most once per session. After the first run (or any keyword-routed run), subsequent messages use Keyword Routing only — never re-fire AskUserQuestion.


Keyword Routing (priority order; first match wins)

Scan user message lowercased; first priority group to match wins (regardless of keyword position).

Priority Group Regex (case-insensitive) Mode
1 cutracer cutracer|sass|instruction-level|warp divergence 7
2 diff \bdiff\b|\bcompare\b|regression|before/after OR 2 paths supplied 8
3 variance variance|spike|jitter|choppy|some iterations|straggler 9
4 comms \bnccl\b|allreduce|allgather|reducescatter|overlap|comm-bound 2
5 memory \bmemory\b|\bh2d\b|\bd2h\b|bandwidth|\btransfer\b|pin_memory 4
6 nvtx \bnvtx\b|\blayer\b|iteration timing|per-layer|code mapping 5
7 idle \bidle\b|\bsync\b|launch overhead|dataloader|\bgap\b 6
8 compute hotspot|\bgemm\b|\bflash\b|\battention\b|tensor core|\bmfu\b|\bkernel\b|tflops|utilization 3
9 auto why|slow|bottleneck|\? OR empty 1

Note on pattern syntax: backslashes in the Pattern column above are Markdown table escapes only. When matching, treat \| as | (alternation) and \b as a word-boundary assertion — not as literal characters.

All 9 priorities are live. Priorities 1–9 route directly to their modes.


Mode routing

Mode Reference Fallback ref (today) Status
1 Auto references/M1_AUTO.md Stage A (live)
2 Comms references/M2_COMMS.md references/DISTRIBUTED.md (deeper topology ref) Stage B1 (live)
3 Compute references/M3_COMPUTE.md references/MFU.md (MFU math) Stage B2 (live)
4 Memory references/M4_MEMORY.md Stage B2 (live)
5 NVTX references/M5_NVTX.md Stage B2 (live)
6 Idle references/M6_IDLE.md Stage B1 (live)
7 CUTracer references/M7_CUTRACER.md Stage C1 (live)
8 Diff references/M8_DIFF.md references/DIFF.md (legacy) Stage C2 (live)
9 Variance references/M9_VARIANCE.md references/VARIANCE.md (legacy) Stage C2 (live)

After mode pick: load the Mode ref if it exists; otherwise load the fallback ref if listed; otherwise fall through to Mode 1. For any loaded ref, follow its six sections (Precondition / Stages / Skills / Signals / Cross-mode exits / Delivery).


profile_health_manifest NVTX fields

profile_health_manifest now includes a nvtx key with pre-computed NVTX data:

Field Meaning
nvtx.has_nvtx true if NVTX_EVENTS contains range annotations
nvtx.top_regions Top-5 NVTX ranges by total wall time (name, total_ms, count)
nvtx.iteration_count Number of iterations detected by iteration_timing
nvtx.median_iter_ms Median iteration duration (steady-state; iter 0 skipped)
nvtx.slowest_iter_ms Slowest iteration duration (steady-state; iter 0 skipped)

Mode 5 / Mode 9 shortcut: if nvtx.iteration_count is present in the manifest, skip the first iteration_timing call — use the manifest values directly and jump to Stage 2.


Non-negotiable rules (full list in PRINCIPLES.md §3)

  1. Never guess kernel or NVTX names — always query first.
  2. MFU > 100% is always wrong — recompute with narrower operation scope.
  3. theoretical_flops must come from the skill — never estimate.
  4. Time in DB is nanoseconds; divide by 1e6 for ms, 1e9 for s.
  5. No SELECT * — always name specific columns.
  6. Skip iteration 0 in Mode 3 MFU and Mode 8 diff — JIT warmup inflates it.
  7. Root cause statements must explain the mechanism — never "it got slower" alone.
  8. Host-sync diagnoses (.item(), _local_scalar_dense, cudaStreamSynchronize) require PRINCIPLES.md §5.7 — NVTX parent range + repo grep → concrete path/file.py:line in Fix.
  9. "Per-step" cost claims require frequency verification (event count ≥ iteration_count); one-shot events must be reframed as one-time overhead, not extrapolated.
  10. Code fixes must render as a before/after code block — no pure-prose "convert to …".
  11. Inferred numeric values must be labeled (inferred from …) or — no silent derivations.

Universal evidence step (PRINCIPLES.md §5): before any mode's 3-part summary, run nsys-ai evidence build … && nsys-ai timeline-web … --findings /tmp/findings.json, then print the URL that timeline-web emits (http://127.0.0.1:PORT). Fail-soft on empty findings.

Install via CLI
npx skills add https://github.com/GindaChen/nsys-ai --skill nsys-ai
Repository Details
star Stars 58
call_split Forks 19
navigation Branch main
article Path SKILL.md
More from Creator