ncu-report-skill - SKILL.md Agent Skill

name: ncu-report-skill description: Profile CUDA kernels with Nsight Compute on B200 / sm_100. Use when the user asks to profile a kernel, analyze its performance, diagnose bottlenecks, read an ncu report, or write an optimization plan — including variants in Chinese ("profile 一下", "为什么慢", "ncu 报告").

Skill: CUDA Kernel Profiling (B200 / Nsight Compute)

When to use: user asks to profile a CUDA kernel, analyze its performance, find its bottlenecks, or write an optimization plan based on Nsight Compute data. Triggers include: "profile X", "为什么这个 kernel 慢", "ncu report 说...", "下一步怎么优化", "帮我看一下这份 ncu 报告".

Target hardware (this repo): NVIDIA B200 (sm_100, CC 10.0, 148 SMs, 192 GB HBM3e). Most advice below is generic; B200-specific notes are explicitly marked.

Golden rule

Profile → Diagnose → Plan, in that order. Never guess.

Most under-performing CUDA kernels are under-performing for exactly one reason that ncu can tell you in 10 seconds. Don't invent hypotheses before you have the report. Don't start coding a fix before you've matched the observed pattern to a known diagnosis. Don't write a wall of suggestions — rank them by evidence and expected impact.

Quickstart (what to do when someone says "profile this kernel")

Create a new run directory first under profile/<run_name>/ at the repo root — one directory per run, never reuse an existing one. Each run contains its own harness/, reports/, analysis/, and REPORT.md. This rule is mandatory in this repo. See reference/00-directory-layout.md.
Decide what you're profiling. What inputs? Which dispatch path? What question do you want answered? If the kernel takes variable-sized inputs (variable seq lengths, variable batch sizes), you must pick specific representative shapes from the user's workload — don't profile with arbitrary inputs.
Build a standalone harness unless the user is profiling through their existing binary. Harnesses compile in seconds, run the kernel in isolation, and let you use -lineinfo cleanly so ncu can map SASS back to source. Compile into profile/<run_name>/harness/. See reference/02-harness-guide.md and the template in helpers/harness_template.cu.
Run two profiles: --set full (with PmSampling sections) for the overview, and --set source --section SourceCounters for per-line stall attribution. Write outputs to profile/<run_name>/reports/. See reference/03-collection.md.
Parse with ncu_report Python module — not by eye-balling the CLI. Write analysis outputs to profile/<run_name>/analysis/. Use the helpers in helpers/. See reference/04-python-api.md.
Work through the six analysis dimensions. See reference/05-analysis-dimensions.md. Every one matters, but on any given kernel only 1–2 will dominate.
Match patterns to the diagnosis playbook. See reference/06-diagnosis-playbook.md. It maps NCU signal → likely cause → concrete fix, with example counts for "how big is this".
Write the report at profile/<run_name>/REPORT.md with evidence-backed recommendations, ranked by expected impact. See reference/07-report-template.md.

File index

Reference docs (read these when you need details)

File	Purpose
`reference/00-directory-layout.md`	Read first. Directory / naming conventions — one run = one subdirectory, no cross-contamination
`reference/01-workflow.md`	End-to-end checklist from "user request" to "final report"
`reference/02-harness-guide.md`	When and how to build a standalone harness (mandatory for TVM-FFI, PyTorch kernels, JIT-compiled code)
`reference/03-collection.md`	ncu command recipes: full, source-level, PM sampling, custom sections
`reference/04-python-api.md`	`ncu_report` Python API patterns with copy-pasteable code
`reference/05-analysis-dimensions.md`	Six analysis dimensions: occupancy, balance, stalls, tensor core, timeline, memory
`reference/06-diagnosis-playbook.md`	Pattern → diagnosis → fix. Merges Blackwell programming principles with NCU signals
`reference/07-report-template.md`	How to structure the final report
`reference/08-b200-metric-names.md`	sm_100 metric names vs older GPUs — many common names are different
`reference/09-common-issues.md`	Permissions, PM sampling gaps, TVM-FFI / PyTorch gotchas

Helpers (reusable code)

File	Purpose
`helpers/harness_template.cu`	Standalone harness template — paste your kernel, fill in input allocation, done
`helpers/safetensors_loader.h`	Header-only safetensors reader (no external deps) for loading real workload tensors
`helpers/analyze_reports.py`	Extract key metrics, produce side-by-side comparisons
`helpers/extract_stall_hotspots.py`	Per-line stall aggregation via `action.source_info(pc)`
`helpers/plot_timeline.py`	ASCII PM-sampling timeline plotter — makes tail effect visible
`helpers/list_flashinfer_workloads.py`	Browse a flashinfer-trace dataset — shape histograms, filter by axis, resolve safetensors paths for specific UUIDs
`helpers/ncu_utils.py`	Shared Python helpers: safe metric access, per-instance extraction, report loading

Critical lessons (don't skip)

The stock ncu_profile_skill.md metric names don't all work on B200. Names like smsp__inst_executed_op_global_ld.sum, dram__bytes.sum, l1tex__average_t_sectors_per_request*.ratio return None on sm_100. Use the sm_100 names in reference/08-b200-metric-names.md or enumerate via action.metric_names().
Always compile with -lineinfo. Without it, ncu's source view is blank and you cannot do per-line stall analysis. If you can't add -lineinfo to the build system (TVM-FFI, PyTorch inline, JIT), build a standalone harness — that's the whole point.
PM sampling is the only way to see tail effects. Static metrics average over the whole kernel; only the time-series (either pmsampling: metrics or the ASCII plotter in helpers/) shows the shape of utilization over time.
Load-imbalance on variable-length inputs is often the #1 bottleneck. If the user's workload has sequences of varying length, per-SM active-cycle variance will often dwarf every other effect. Always check the input distribution.
NCU's rule engine (--page details) already does half the work. Each rule comes with Est. Speedup: X%. Read them first — they often point straight at the answer.
Don't delegate understanding. Run the profiles yourself, open the reports, cite specific metric values. Never write "the profile shows it's memory-bound" — instead, name the two or three metric values that back your conclusion (e.g., "dram__bytes_read.sum.pct_of_peak_sustained_elapsed well under 10%, and long_scoreboard stalls dominate the pcsamp histogram, so the kernel is latency-bound on L1, not DRAM-bandwidth-bound"). Fill in the actual numbers from your report. Specificity is the deliverable.

Related skills

blackwell-cuda-programming.md — Blackwell-specific programming principles and checklists, preserved as a companion reference. Use it when proposing new kernel designs; use this skill when diagnosing existing kernels.