name: ncu-report-skill description: Profile CUDA kernels with Nsight Compute on B200 / sm_100. Use when the user asks to profile a kernel, analyze its performance, diagnose bottlenecks, read an ncu report, or write an optimization plan — including variants in Chinese ("profile 一下", "为什么慢", "ncu 报告").
Skill: CUDA Kernel Profiling (B200 / Nsight Compute)
When to use: user asks to profile a CUDA kernel, analyze its performance, find its bottlenecks, or write an optimization plan based on Nsight Compute data. Triggers include: "profile X", "为什么这个 kernel 慢", "ncu report 说...", "下一步怎么优化", "帮我看一下这份 ncu 报告".
Target hardware (this repo): NVIDIA B200 (sm_100, CC 10.0, 148 SMs, 192 GB HBM3e). Most advice below is generic; B200-specific notes are explicitly marked.
Golden rule
Profile → Diagnose → Plan, in that order. Never guess.
Most under-performing CUDA kernels are under-performing for exactly one reason that ncu can tell you in 10 seconds. Don't invent hypotheses before you have the report. Don't start coding a fix before you've matched the observed pattern to a known diagnosis. Don't write a wall of suggestions — rank them by evidence and expected impact.
Quickstart (what to do when someone says "profile this kernel")
Create a new run directory first under
profile/<run_name>/at the repo root — one directory per run, never reuse an existing one. Each run contains its ownharness/,reports/,analysis/, andREPORT.md. This rule is mandatory in this repo. Seereference/00-directory-layout.md.Decide what you're profiling. What inputs? Which dispatch path? What question do you want answered? If the kernel takes variable-sized inputs (variable seq lengths, variable batch sizes), you must pick specific representative shapes from the user's workload — don't profile with arbitrary inputs.
Build a standalone harness unless the user is profiling through their existing binary. Harnesses compile in seconds, run the kernel in isolation, and let you use
-lineinfocleanly so ncu can map SASS back to source. Compile intoprofile/<run_name>/harness/. Seereference/02-harness-guide.mdand the template inhelpers/harness_template.cu.Run two profiles:
--set full(withPmSamplingsections) for the overview, and--set source --section SourceCountersfor per-line stall attribution. Write outputs toprofile/<run_name>/reports/. Seereference/03-collection.md.Parse with
ncu_reportPython module — not by eye-balling the CLI. Write analysis outputs toprofile/<run_name>/analysis/. Use the helpers inhelpers/. Seereference/04-python-api.md.Work through the six analysis dimensions. See
reference/05-analysis-dimensions.md. Every one matters, but on any given kernel only 1–2 will dominate.Match patterns to the diagnosis playbook. See
reference/06-diagnosis-playbook.md. It maps NCU signal → likely cause → concrete fix, with example counts for "how big is this".Write the report at
profile/<run_name>/REPORT.mdwith evidence-backed recommendations, ranked by expected impact. Seereference/07-report-template.md.
File index
Reference docs (read these when you need details)
| File | Purpose |
|---|---|
reference/00-directory-layout.md |
Read first. Directory / naming conventions — one run = one subdirectory, no cross-contamination |
reference/01-workflow.md |
End-to-end checklist from "user request" to "final report" |
reference/02-harness-guide.md |
When and how to build a standalone harness (mandatory for TVM-FFI, PyTorch kernels, JIT-compiled code) |
reference/03-collection.md |
ncu command recipes: full, source-level, PM sampling, custom sections |
reference/04-python-api.md |
ncu_report Python API patterns with copy-pasteable code |
reference/05-analysis-dimensions.md |
Six analysis dimensions: occupancy, balance, stalls, tensor core, timeline, memory |
reference/06-diagnosis-playbook.md |
Pattern → diagnosis → fix. Merges Blackwell programming principles with NCU signals |
reference/07-report-template.md |
How to structure the final report |
reference/08-b200-metric-names.md |
sm_100 metric names vs older GPUs — many common names are different |
reference/09-common-issues.md |
Permissions, PM sampling gaps, TVM-FFI / PyTorch gotchas |
Helpers (reusable code)
| File | Purpose |
|---|---|
helpers/harness_template.cu |
Standalone harness template — paste your kernel, fill in input allocation, done |
helpers/safetensors_loader.h |
Header-only safetensors reader (no external deps) for loading real workload tensors |
helpers/analyze_reports.py |
Extract key metrics, produce side-by-side comparisons |
helpers/extract_stall_hotspots.py |
Per-line stall aggregation via action.source_info(pc) |
helpers/plot_timeline.py |
ASCII PM-sampling timeline plotter — makes tail effect visible |
helpers/list_flashinfer_workloads.py |
Browse a flashinfer-trace dataset — shape histograms, filter by axis, resolve safetensors paths for specific UUIDs |
helpers/ncu_utils.py |
Shared Python helpers: safe metric access, per-instance extraction, report loading |
Critical lessons (don't skip)
The stock
ncu_profile_skill.mdmetric names don't all work on B200. Names likesmsp__inst_executed_op_global_ld.sum,dram__bytes.sum,l1tex__average_t_sectors_per_request*.ratioreturnNoneon sm_100. Use the sm_100 names inreference/08-b200-metric-names.mdor enumerate viaaction.metric_names().Always compile with
-lineinfo. Without it, ncu's source view is blank and you cannot do per-line stall analysis. If you can't add-lineinfoto the build system (TVM-FFI, PyTorch inline, JIT), build a standalone harness — that's the whole point.PM sampling is the only way to see tail effects. Static metrics average over the whole kernel; only the time-series (either
pmsampling:metrics or the ASCII plotter inhelpers/) shows the shape of utilization over time.Load-imbalance on variable-length inputs is often the #1 bottleneck. If the user's workload has sequences of varying length, per-SM active-cycle variance will often dwarf every other effect. Always check the input distribution.
NCU's rule engine (
--page details) already does half the work. Each rule comes withEst. Speedup: X%. Read them first — they often point straight at the answer.Don't delegate understanding. Run the profiles yourself, open the reports, cite specific metric values. Never write "the profile shows it's memory-bound" — instead, name the two or three metric values that back your conclusion (e.g., "
dram__bytes_read.sum.pct_of_peak_sustained_elapsedwell under 10%, andlong_scoreboardstalls dominate the pcsamp histogram, so the kernel is latency-bound on L1, not DRAM-bandwidth-bound"). Fill in the actual numbers from your report. Specificity is the deliverable.
Related skills
blackwell-cuda-programming.md— Blackwell-specific programming principles and checklists, preserved as a companion reference. Use it when proposing new kernel designs; use this skill when diagnosing existing kernels.