name: html-metal-explainer description: Generate educational HTML reports that explain Metal GPU kernel internals, optimizations, and parallelism for the vllm-metal project. Use when the user asks to explain a kernel concept, compare kernel implementations, document an optimization proposal, or create onboarding material about Metal shaders, simdgroups, threadgroups, online softmax, paged attention, or any Apple Silicon GPU topic in vllm-metal. Also trigger when someone asks for a "tutorial", "explainer", "guide", or "report" about Metal kernels, or when they want to hand off kernel analysis to someone less familiar with GPU programming. user_invocable: true
HTML Metal Explainer
Generate polished, dark-themed HTML educational reports about Metal GPU kernel topics in vllm-metal. The output is a single self-contained .html file that opens in any browser — no build step, no dependencies.
When to Use
- Explaining a kernel optimization (proposed or existing)
- Comparing two kernel implementations (e.g., MLA vs GQA paged attention)
- Creating onboarding docs for someone unfamiliar with GPU/Metal programming
- Documenting a PR's kernel changes for review or handoff
- Any request for a "tutorial", "guide", or "explainer" about Metal kernel internals
Output Location
Save the HTML file to research/ in the project root (create the directory if needed). Use a descriptive kebab-case filename:
research/mla-block-batching-explainer.htmlresearch/simdgroup-parallelism-guide.htmlresearch/gqa-vs-mla-comparison.html
Open in the browser after writing: open <path>
Research Before Writing
Before generating the HTML, read the relevant source files to ground the explanation in real code. Key locations in vllm-metal:
vllm_metal/metal/kernels_v2/pagedattention.metal— GQA paged attention kernel (production, mature)vllm_metal/metal/kernels_v2/mla.metal— MLA paged attention kernel (if it exists)vllm_metal/metal/kernels_v2/utils.metal— shared utilitiesvllm_metal/metal/paged_ops.cpp— C++ nanobind dispatchervllm_metal/metal/__init__.py— Python entry pointsvllm_metal/paged_attention_backend/mla.py— MLA wrapper / routing- MLX reference kernels at
.venv-vllm-metal/lib/python3.12/site-packages/mlx/include/mlx/backend/metal/kernels/
Also read CLAUDE.md for project architecture context.
Document Structure
Every report follows this skeleton. Adapt section count and titles to the topic, but keep the progression from "background concepts" → "the specific topic" → "decision / summary":
Background sections (1-3) — build up prerequisite knowledge. Start from what the reader needs to know, not what you want to teach. For kernel topics, this usually means: GPU thread hierarchy, memory hierarchy, and the specific algorithm (attention, softmax, etc.).
Core analysis sections (1-3) — the actual topic. Show current code, explain the problem or comparison, propose or evaluate alternatives. Use concrete code from the repo, not hypothetical examples.
Decision / Summary section (1) — a clear verdict table and actionable next steps. Include a validation protocol (what tests to run, what to benchmark).
Design System
Use this exact CSS and HTML structure. The dark theme with GitHub-inspired colors is deliberate — it matches the terminal-centric workflow of kernel developers. See references/design-system.md for the full CSS and component catalog.
Core Components
Numbered section headers:
<h2><span class="num">1</span> Section Title</h2>
Concept callout boxes (4 variants):
<div class="concept"> <!-- blue: neutral info -->
<div class="concept warn"> <!-- yellow: gotcha or caveat -->
<div class="concept good"> <!-- green: key insight or win -->
<div class="concept danger"> <!-- red: pitfall or risk -->
<div class="label">Label Text</div>
<p>Content...</p>
</div>
ASCII diagrams (for parallelism layouts, memory hierarchies, data flow):
<div class="diagram">
Threadgroup (1024 threads)
├── Simdgroup 0 (32 lanes)
│ ├── Lane 0 → dims [0..15]
│ └── Lane 31 → dims [496..511]
└── Simdgroup 31
</div>
Decision boxes (for recommendations):
<div class="decision">
<h4>Decision: Should We Do X?</h4>
<p><span class="verdict yes">Yes</span> Explanation...</p>
<!-- or: <span class="verdict maybe">Defer</span> -->
<!-- or: <span class="verdict no">No</span> -->
</div>
Comparison tables:
<table>
<tr><th>Feature</th><th>Kernel A</th><th>Kernel B</th></tr>
<tr><td>exp function</td><td>fast::exp</td><td>fast::exp2</td></tr>
</table>
Inline highlights for key terms:
<span class="hl">blue highlight</span>
<span class="hl2">green highlight</span>
<span class="hl3">purple highlight</span>
<span class="hl4">orange highlight</span>
Writing Style
- Progressive disclosure: start from what the reader knows, build up. Don't dump all concepts at once.
- Concrete over abstract: always show real code from the repo, with file paths and line numbers. Never invent hypothetical kernels.
- Why before what: explain why a design choice was made before showing the code. GPU kernel decisions are driven by hardware constraints — surface those constraints.
- Quantify claims: "16 FMAs per token" not "many operations". "32 KB shmem limit on M1" not "limited memory".
- Use diagrams liberally: ASCII art in
<div class="diagram">blocks. GPU parallelism is inherently spatial — draw it. - One concept per callout box: don't overload
.conceptboxes. Each should teach exactly one thing.
Content Guidelines by Topic Type
Optimization Proposals
Structure: Problem → Current Code → The Idea (with diagram) → Concrete Fix (code sketch) → Expected Impact (table by regime) → Risks → Decision Box
Kernel Comparisons
Structure: What Each Kernel Does → Shared Concepts → Feature-by-Feature Table → Unique Contributions of Each → Summary
Concept Explainers (onboarding)
Structure: Analogy/Intuition → Formal Definition → How It Appears in This Codebase → Common Pitfalls → Exercises/Questions
PR Kernel Reviews
Structure: What Changed → Why It Matters → Algorithm Deep-Dive → Correctness Analysis → Performance Analysis → Decision
Regime Awareness
When discussing performance, always distinguish between:
- Bandwidth-bound (decode, G=1, small batch): memory load latency dominates. Compute savings are invisible.
- Compute-bound (prefill, G=4, large batch): arithmetic throughput dominates. Instruction count matters.
- Transitional (G=2, medium batch): both contribute. Measure to know which dominates.
Never claim an optimization "helps" without specifying the regime. Use tables with regime columns.