hipfire-kernel-atlas

name: hipfire-kernel-atlas description: Use Kernel Atlas to collect phase-aware hipfire measurements and render ISA Fit View visualizations for AMD GPU kernels, quant formats, and architectures. Use when a user asks how MQ/HFQ/HFP/Q8 quants occupy hardware, asks for an ASCII ISA visualization, wants to compare gfx1010/gfx1030/gfx11/gfx12 kernel fit, or wants an agent-readable "left on table" summary from Atlas rows.

Use this skill when the task is to explain or visualize how a hipfire quant format and kernel use an AMD GPU ISA target. The primary tool is scripts/kernel_atlas.py; this skill is a thin agent wrapper around that CLI.

Core Workflow

Collect or locate Atlas rows
- Prefer existing JSONL under .codeinsight+research/kernel-atlas/runs/.
- For AR prefill/decode, collect with collect-ar.
- Use --profile-prefill / --profile-decode for AR rows when the user wants the ISA view scoped to runtime-hot kernels and tagged by op role.
- For speculative decode, collect with collect-dflash.
- Keep raw run data in .codeinsight+research/; it is ignored and may be private.
Attach ISA metadata
- Use --isa-file for one known HSACO/code object.
- Use --isa-dir .hipfire_kernels/<arch> plus --isa-filter for a bounded set.
- Prefer --isa-output <path>.json so multiple rows reference one manifest.
Attach dispatch/source provenance
- Use --dispatch-provenance when rows have profiled kernel names.
- Prefer --dispatch-output <path>.json so multiple rows reference one manifest.
- Treat dispatch references as evidence to inspect, not proof of a unique runtime branch.
- Prefer rows with a known arch; source ranking is target-arch-aware when arch-specific kernel files exist.
Render the ISA Fit View
- Use .agents/skills/hipfire-kernel-atlas/render-fit.sh.
- If a row has artifacts.profile_kernels, the view joins profiled kernel names to ISA object kernel names/symbols and summarizes only matched objects.
- If a row has dispatch provenance, the view prints hot-kernel op/source/dispatch attribution.
- Report the visual plus a short readout of likely limit and left on table.
Ask Atlas for candidate experiments
- Use python3 scripts/kernel_atlas.py suggest --row ... --isa ... --dispatch ....
- Prefer --format markdown for humans and JSON for automation.
- Let suggest auto-load default history from .codeinsight+research/kernel-atlas/tasks/; use --history only for extra history paths.
- Treat suggestions as an experiment queue, not as predicted wins.
- Each suggestion should name the lever type, hot kernel, files, risk, rationale, and eval contract.
Create an optimization task
- Use python3 scripts/kernel_atlas.py task to turn a row into task.json and TASK.md.
- Include --allowed-file for every path an agent may edit.
- Include correctness commands for DFlash or risky runtime changes.
- Generated tasks strip known profiling/instrumentation env from eval and preserve the original row env as baseline.row_env.
Evaluate a candidate
- Use python3 scripts/kernel_atlas.py eval --task ... --runs 5 --warmup-runs 1 --output-dir ....
- Use --refresh-baseline first to write baseline.json; use --baseline <baseline.json> for candidate comparisons.
- Report result.json status, selected metric median, speedup, stability, and any failed command output tail.
- Treat the local ledger.jsonl as experiment lineage, not a public benchmark.
- If status is needs_baseline, do not claim a speedup; refresh or provide a clean baseline first.

Commands

Render an existing row:

.agents/skills/hipfire-kernel-atlas/render-fit.sh \
  --row .codeinsight+research/kernel-atlas/runs/atlas.jsonl \
  --row-index 0 \
  --isa .codeinsight+research/kernel-atlas/runs/isa.json

Collect a small AR smoke with ISA:

python3 scripts/kernel_atlas.py collect-ar \
  --model ~/.hipfire/models/qwen3.5-0.8b.mq4 \
  --workload qwen3.5-0.8b \
  --model-size 0.8b \
  --quant mq4 \
  --prefill 32 \
  --gen 5 \
  --kv-mode asym3 \
  --profile-prefill \
  --profile-decode \
  --isa-dir .hipfire_kernels/gfx1030 \
  --isa-filter 'gemm_hfq4g256|gemv_hfq4g256' \
  --isa-output .codeinsight+research/kernel-atlas/runs/isa-gfx1030.json \
  --dispatch-provenance \
  --dispatch-output .codeinsight+research/kernel-atlas/runs/dispatch-gfx1030.json \
  --output .codeinsight+research/kernel-atlas/runs/atlas-gfx1030.jsonl

Suggest candidate experiments from a profiled row:

python3 scripts/kernel_atlas.py suggest \
  --row .codeinsight+research/kernel-atlas/runs/atlas-gfx1201.jsonl \
  --row-index 1 \
  --isa .codeinsight+research/kernel-atlas/runs/isa-gfx1201.json \
  --dispatch .codeinsight+research/kernel-atlas/runs/dispatch-gfx1201.json \
  --format markdown

Create a bounded task from a profiled row:

python3 scripts/kernel_atlas.py task \
  --row .codeinsight+research/kernel-atlas/runs/atlas-gfx1201.jsonl \
  --row-index 1 \
  --isa .codeinsight+research/kernel-atlas/runs/isa-gfx1201.json \
  --dispatch .codeinsight+research/kernel-atlas/runs/dispatch-gfx1201.json \
  --allowed-file kernels/src/gemv_hfq4g256_multirow.hip \
  --output-dir .codeinsight+research/kernel-atlas/tasks/gfx1201-gemv-r4

Create a PyTorch-shape task for non-Qwen work:

python3 scripts/kernel_atlas.py task-pytorch \
  --name llama-rmsnorm-shape \
  --op rmsnorm \
  --input-shape 1,2048,4096 \
  --dtype float16 \
  --eval-command 'python3 bench_rmsnorm.py' \
  --allowed-file kernels/src/rmsnorm_candidate.hip \
  --output-dir .codeinsight+research/kernel-atlas/tasks/llama-rmsnorm-shape

Refresh a stable baseline and then evaluate a candidate:

python3 scripts/kernel_atlas.py eval \
  --task .codeinsight+research/kernel-atlas/tasks/gfx1201-gemv-r4/task.json \
  --runs 5 \
  --warmup-runs 1 \
  --refresh-baseline \
  --output-dir .codeinsight+research/kernel-atlas/tasks/gfx1201-gemv-r4/eval-baseline

python3 scripts/kernel_atlas.py eval \
  --task .codeinsight+research/kernel-atlas/tasks/gfx1201-gemv-r4/task.json \
  --baseline .codeinsight+research/kernel-atlas/tasks/gfx1201-gemv-r4/eval-baseline/baseline.json \
  --runs 5 \
  --warmup-runs 1 \
  --output-dir .codeinsight+research/kernel-atlas/tasks/gfx1201-gemv-r4/eval-001

Interpretation Rules

Treat the view as ISA fit, not full hardware occupancy. True occupancy also needs counters, wave residency, clocks, cache behavior, and launch overlap.
If matrix units are available but observed matrix ops are zero, ask whether the workload phase should route through WMMA/MFMA or whether it is a decode GEMV path where memory/launch dominates.
If VGPR/SGPR/spills are high, prioritize register pressure and spill removal before claiming a bandwidth win.
If the row is DFlash, do not treat tok/s alone as correctness evidence. Run the DFlash coherence gate before claiming a spec-decode improvement.
If eval reports unstable, do not claim a win or regression; tighten the run shape or rerun after DPM/thermal state settles.
For PyTorch-shape tasks, treat the eval command as the source of truth until Atlas has a real PyTorch profiler/extractor producer.
If the worktree is dirty, cite the row's provenance.diff_md5 and avoid comparing it as a shipped baseline.

Good Agent Output

Include:

the rendered ASCII fit view, or the most relevant section of it
the row path and ISA manifest path
arch, quant, phase, and shape bucket
runtime metric used for the readout
one concise interpretation of likely limit and left on table

Avoid:

calling the heuristic a roofline model
claiming a perf win from smoke runs
mixing rows from different prompts or dirty binaries without saying so