name: ROCmKernelWiki description: Use when the user asks about optimizing AMD Instinct / Radeon GPU kernels — MI300 (gfx942/CDNA3), MI350/MI355X (gfx950/CDNA4), or RDNA4 (gfx1201) — MFMA/matrix-core programming, LDS bank conflicts, direct-to-LDS async copy, s_waitcnt pipelining, FP8/FP6/FP4/MXFP block scaling, wave64 reductions, GEMM/FlashAttention/fused-MoE/paged-attention kernels, or Composable-Kernel/CK-tile/FlyDSL/Triton-AMD/rocWMMA/HIP/GCN-assembly. Also for concrete merged-PR references from ROCm/composable_kernel, aiter, hipBLASLt, Tensile, rocBLAS, flash-attention, FlyDSL, and triton. Do NOT use for generic CUDA/NVIDIA-only questions, host-side framework integration, or ROCm install/driver troubleshooting. argument-hint: "[natural-language-question] | [--tag mfma --type kernel] | [page-id]" allowed-tools: "Bash Read Grep Glob"
ROCmKernelWiki — AMD CDNA / RDNA Kernel Optimization Wiki
Knowledge cutoff: 2026-05-15. All upstream PR data, doc summaries, and blog summaries reflect upstream state on or before this date (per
data/refresh-cutoff.yaml). Re-run the harvest tooling to advance the cutoff.
Query a structured, cross-referenced knowledge base of AMD GPU kernel optimization for CDNA3 (gfx942 / MI300), CDNA4 (gfx950 / MI350-MI355X), and RDNA4 (gfx1201) — 7,400+ merged-PR references, ~53 wiki synthesis pages, 20 doc/blog summaries, and 8 reference-repository studies.
Inspired by, and modeled on, MIT Han Lab's KernelWiki (the Blackwell/Hopper kernel knowledge base) — see the citation in the README.
When To Use This Skill
Trigger this skill when the user asks about:
- AMD matrix-core programming —
v_mfma_*shapes/dtypes, AGPR accumulators, the gfx950f8f6f4unified low-precision path, MX (E8M0) block scaling - CDNA memory model — LDS bank conflicts (32-bank gfx942 vs 64-bank gfx950),
buffer vs global vs flat, OOB branchless guards, direct-to-LDS async copy,
s_waitcnt(vmcnt/lgkmcnt) pipelining - Kernel implementations — CK/hipBLASLt GEMM, FP8 block-scaled GEMM, FlashAttention-2 (CK-tile), paged attention, fused MoE, MLA decode, RMSNorm, bandwidth microbenchmarks
- Performance problems — bank conflicts, low occupancy, VGPR/AGPR pressure, memory-bound, idle matrix cores, tail effects, XCD/L2 locality
- DSLs & languages — HIP, GCN/CDNA assembly, Composable Kernel / CK-tile, FlyDSL, Triton (AMD backend), rocWMMA
- Migration — CUDA→HIP (cp.async→direct-LDS, mbarrier→s_waitcnt, wgmma→mfma), gfx942→gfx950 (FNUZ→OCP FP8!), RDNA WMMA vs CDNA MFMA
- PR references — "how did CK/AITER/hipBLASLt/Triton implement X on MI300/MI350?"
Do NOT use this skill for:
- Generic CUDA / NVIDIA-only kernel questions (use KernelWiki for Blackwell/Hopper)
- Host-side framework integration (model loading, request routing, scheduling)
- ROCm installation, driver, or environment troubleshooting
How To Query
First, cd into the skill directory (the clone root — where this SKILL.md
lives), then run the tools. The scripts auto-resolve the wiki root from their own
location, so once you cd there no environment variable is needed:
cd "$(dirname "$(find ~/.claude/skills -name SKILL.md -path '*ROCmKernelWiki*' | head -1)")"
# ...or just: cd ~/.claude/skills/ROCmKernelWiki
If you cannot cd (e.g. running from another working directory), call the
scripts by absolute path — they still resolve the wiki root correctly:
python3 ~/.claude/skills/ROCmKernelWiki/scripts/query.py --tag mfma --type kernel
All example commands below assume you have cd'd into the skill directory.
Path 1: Unified search (preferred for natural language)
python3 scripts/query.py "how to pipeline MFMA on MI300"
python3 scripts/query.py --tag mfma --type kernel
python3 scripts/query.py --repo composable_kernel --architecture gfx950 --limit 20
python3 scripts/query.py --symptom bank-conflicts --compact
Filters: --type, --tag, --repo, --language, --architecture,
--symptom, --confidence, --synthesis, --limit, --compact, --paths-only.
Results are ranked IDF-weighted with priors that surface curated wiki pages and
runnable examples above raw PR noise, and each hit shows a matched-text snippet.
Add --synthesis to restrict to curated wiki pages (skip the 7,400+ PR sources).
--tag and --architecture accept aliases — --tag XDLOP matches mfma,
--tag cp.async matches async-copy, --architecture MI300 matches gfx942,
--architecture MI355X matches gfx950.
Path 2: Fetch a specific page by id or path
python3 scripts/get_page.py kernel-fp8-gemm # wiki page lists "Implementing PRs"
python3 scripts/get_page.py pr-composable_kernel-1234 # PR page lists "Synthesized in" wiki pages
python3 scripts/get_page.py kernel-fp8-gemm --follow-sources
python3 scripts/get_page.py pr-composable_kernel-1234 --include-code --summary # compact diff
python3 scripts/get_page.py hw-mfma --body-only
Every wiki page now carries implemented_by: (the real PRs that built it) and
every linked PR carries related: back to the synthesis page — so you can hop
between "what it is" and "how it was actually implemented". Use --include-code --summary to read a PR's diff_summary.md (files + key changed lines) instead of
the full diff.
Path 3: Regex text search across wiki bodies and PR pages
python3 scripts/grep_wiki.py "v_mfma_f32_16x16x16"
python3 scripts/grep_wiki.py "global_load_lds" --only sources
python3 scripts/grep_wiki.py "ds_bpermute|mov_dpp" --any
Path 4: Pre-built cross-reference indices
Auto-generated under queries/:
queries/by-problem.md— symptom → pattern page → candidate techniquesqueries/by-technique.md— techniques → every page that uses themqueries/by-hardware-feature.md— mfma/lds/async-copy/mxfp/… → pagesqueries/by-kernel-type.md— gemm/attention/moe/… → pagesqueries/by-language.md— hip/gcn-asm/composable-kernel/flydsl/triton → pagesqueries/by-repo.md— all 7,400+ PRs across the tracked ROCm repos
Path 5: Primer, schema, examples
references/primer.md— topic map; read first when the question is broad.references/schema.md— frontmatter schema, confidence/reproducibility ladders, controlled vocabulary, canonical aliases.references/examples.md— 7 worked query patterns.
Output Pattern
When answering from this KB:
- State the architecture. A fact true on gfx942 may differ on gfx950 — most importantly, gfx942 FP8 is FNUZ and gfx950 FP8 is OCP (not bit-compatible), and LDS is 64 kB/32-bank on gfx942 vs 160 kB/64-bank on gfx950.
- Cite specific pages with paths and IDs (e.g.
wiki/hardware/mfma.md,hw-mfma). - Follow
sources:to trace claims to PRs/docs/blogs/refs. - Respect confidence —
verified>source-reported>inferred>experimental. - Report performance claims with all fields — gpu, dtype, shape, metric, value, source_id.
Knowledge Base Contents (cutoff 2026-05-15)
- 7,400+ PR reference pages across ROCm/composable_kernel, aiter, hipBLASLt, Tensile, rocBLAS, flash-attention, FlyDSL, triton, plus ROCm-filtered vLLM/SGLang
- ~53 wiki synthesis pages — hardware, techniques, kernels, patterns, languages, migration
- 20 doc/blog summaries + 8 reference-repository studies
- 9 candidate ledgers classifying every scanned PR (include/defer/exclude)
- 6 auto-generated query indices
- 959 real upstream PR diffs in
artifacts/+ 12 runnable, hipcc-compiled kernel examples inexamples/ - Validator
scripts/validate.py— schema, vocabulary, link-integrity (0 errors)
Quality Guarantees
- Every hardware fact traces to an official AMD ISA doc / whitepaper.
- Every technique/kernel/language page carries a real code snippet.
- Every PR page has
inclusion_reasonandstatus: merged. verifiedpages carryevidence_basis(official-doc + upstream-code/paper).- 0 dangling internal references (enforced by the validator).