name: dev_kf_distill description: Distilled, self-contained algorithm for an LLM-driven, MAP-Elites-based GPU kernel optimizer — closed-loop generation guided by a static behavior descriptor, optimization-aware prompting, optional QD-gradient tracking, and a tiered evaluator with a continuous combined score. Use this skill when you want to (a) understand what makes such a loop work, (b) re-implement it in a smaller harness or new project, or (c) port it to a new GPU backend.
Rules
- This skill is research / documentation. Do not commit, push, or open PRs from it.
What this skill is
This skill documents one closed-loop algorithm for generating high-performance GPU kernels from a reference implementation, using an LLM as the only "mutation operator." It is a fully self-contained description: every file in this folder reads on its own, with no required reading outside the skill. There are no required cross-references to other skills, source trees, or experiment logs.
The algorithm and its parts:
┌─────────────────────────────────────────────────────────────────────────┐
│ sample_parent → build_prompt → LLM → extract_code → evaluate │
│ ↑ │ │
│ └────────────── update_archive ←────── score ───────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Six things make the loop work as well as it does. Each is documented in
its own file in this folder; this SKILL.md is the executive summary.
1. The behavior descriptor (4D, static, deterministic)
Every candidate is mapped to a 4-tuple of integer levels in [0..3]:
(memory_opt, compute_opt, parallelism_opt, esimd_opt)
The mapping is done by regex pattern matching on the source code — no execution, no LLM, no normalization. This is the single most important design choice: behavioral coordinates are a property of the text, not of the runtime, so they are reproducible across runs and cheap to compute.
The full grid is 4 × 4 × 4 × 4 = 256 cells. Each cell stores at most
one elite — the highest-scoring candidate that ever landed there.
2. MAP-Elites with island model
The archive is a MAP-Elites grid over the descriptor above, optionally sharded into N islands that evolve in parallel and migrate elites every K generations. Sampling from the archive is a blend of:
- exploitation (sample fitness-proportionally from current island top-K),
- exploration (sample from underexplored cells, including empty ones),
- uniform (random elite — break stagnation).
Mix ratios (default 0.7 / 0.2 / 0.1) and per-island migration are the
only "evolutionary" hyperparameters — there is no crossover, no mutation
operator in the GA sense. The mutation is the LLM call.
See: MAP_ELITES_LOOP.md
3. Optimization-aware prompting
Before every LLM call, a target profile is selected — a 4D coordinate the loop wants the next candidate to land on. The selection is one of:
mutate— small delta from parent's profile (most common),diversify— sample from underexplored cells,esimd_upgrade— push toward explicit-SIMD / Tensor-Core if the parent already has memory+compute+parallelism but no ESIMD.
The selected profile is rendered as concrete, actionable instructions
(e.g. "use intel_sub_group_block_read_us for B loads, add restrict")
and appended to the prompt. The instructions are keyed by
(dimension, backend, level).
This is the "WHAT TO DO, not WHY" half of the prompt — it tells the model which optimization to apply, while the descriptor classifier later confirms whether the model actually did it.
See: PROMPT_PIPELINE.md
4. QD-gradient transition tracking (optional)
For every parent → child transition the loop records:
(parent_coords, child_coords, parent_fitness, child_fitness, outcome)
outcome ∈ {improvement, neutral, regression, cell_discovery, elite_replacement}.
From the transition history a coarse "gradient" is estimated: which delta in descriptor space, applied from a given cell, historically yielded improvements. This gradient feeds back into:
- parent sampling (prefer parents whose neighborhood has positive gradient),
- prompt mutation hints ("based on history, try increasing parallelism_opt"),
- exploration (avoid directions with consistent regression).
Optional. The loop works without it; with it, it reduces wasted trials once the archive has any history.
See: QD_GRADIENT.md
5. Tiered evaluator + combined score
A candidate passes through up to four stages:
| Tier | What | Cost |
|---|---|---|
| T0 | extract code, syntax pre-check (compile in process) | seconds |
| T1 | full build (out-of-process) | seconds |
| T2 | correctness pytest (reference vs custom outputs) | ~10s |
| T3 | performance pytest (timed runs, min over N) | ~10–20s |
| T4* | profiler (vendor profiler) — only on promising | expensive |
The fitness score combines all of them:
combined_score = perf_score # 0..5 ladder
+ 3 * 1[correct AND speedup > 0] * speedup
perf_score is a coarse 0..5 ladder (0 = syntax error, 5 = correct), so
the score is monotone in "how far did this candidate get." The
+3 * speedup term makes correct-and-faster candidates dominate
correct-but-not-faster ones.
See: EVALUATOR.md
6. Compile-error + log feedback loop
Whatever the evaluator produces — compile error text, pytest traceback, profiler summary — is summarized and injected into the next trial's prompt as:
[COMPILE ERROR FEEDBACK FROM PREVIOUS TRIAL]block (top-3 most recent),- the parent's stdout/eval log, optionally rewritten by a "feedback LLM",
- structured profiler hints from a vendor profiler.
This is the closed-loop part. Without it, the LLM rediscovers the same build errors every iteration; with it, the loop converges roughly 2–3× faster on real tasks.
See: PROMPT_PIPELINE.md
When to use this skill
Pick this skill when you want to:
Understand the loop — read SKILL.md (here) plus the six files it links. Together they describe the algorithm completely.
Re-implement the loop somewhere else — a smaller agent, a different orchestration framework, a custom Python harness. Read the six files and follow PSEUDOCODE.py as a template. The pseudocode includes representative starter pattern dictionaries and instruction text so it is runnable end-to-end once you wire in an LLM client and a build/test harness.
Port to a new backend — adding TPU / ROCm / Metal / etc.: only two parts of the design are backend-specific — BEHAVIOR_DESCRIPTOR.md (the regex patterns) and PROMPT_PIPELINE.md (the per-
(dim, level)instruction text). The rest is backend-agnostic.Reason about a failure mode — if the loop is generating slow kernels, the bug is almost always in one of: descriptor mis-classification (item 1), bad target-profile selection (item 3), or score not rewarding what you want (item 5).
When NOT to use this skill
- You want a single, one-shot kernel from an LLM (no archive, no loop) — this skill is overkill. Just call the LLM with the reference + a good prompt.
- You only have one sample per generation budget (e.g. one LLM call total) — there is no archive to update; the algorithm degenerates.
- The reference kernel cannot be evaluated reliably (no build, no test) — the score channel collapses and elite selection becomes random.
Pseudocode summary
The whole algorithm in one page:
archive = MapElitesArchive(grid_shape=(4,4,4,4), num_islands=N)
archive.add(initial_program(reference_code))
gradient_tracker = TransitionTracker() # optional
for trial in range(max_iters):
branches = []
for _ in range(branches_per_iteration):
# 1. SAMPLE PARENT
if archive.is_empty():
parent = initial_program(reference_code)
else:
parent = archive.sample(
exploit=0.7, explore=0.2, random=0.1,
gradient_weight=0.3, # reweight by gradient_tracker
)
# 2. SELECT TARGET PROFILE (in optimization space)
target = pick_target_profile(
parent_profile=classify(parent.code),
strategy=config.exploration_strategy, # mutate / diversify / esimd_upgrade
underexplored=archive.underexplored_cells(),
)
# 3. BUILD PROMPT
prompt = render_main_template(
reference=reference_code,
parent=parent,
inspirations=archive.top_k_in_island(parent.island, k=3),
rag_examples=rag.lookup(reference_code, target),
hardware_specs=hardware,
)
prompt += optimization_instructions_for(target, backend)
prompt += previous_compile_errors(top_n=3)
prompt += gradient_tracker.mutation_hints(parent.profile)
# 4. LLM
candidate_code = extract_code(llm.complete(prompt))
# 5. EVALUATE (tiered)
result = evaluate(candidate_code) # syntax → build → correct → perf [→ profile]
# 6. SCORE + UPDATE
score = result.perf_score + 3 * (result.correct and result.speedup > 0) * result.speedup
candidate = Program(code=candidate_code, profile=classify(candidate_code), score=score)
became_elite = archive.add(candidate)
gradient_tracker.record(parent, candidate, became_elite)
branches.append(candidate)
# 7. EARLY-STOP / MIGRATE
if early_stopping_satisfied(archive, patience=2): break
if trial % migration_interval == 0: archive.migrate_islands()
return archive.best()
That's the entire algorithm. Each step is documented in one of the six files in this folder.
How to read this skill
Recommended order:
- First: this
SKILL.md(you're here). - BEHAVIOR_DESCRIPTOR.md — what the cells are.
- MAP_ELITES_LOOP.md — how the archive moves.
- PROMPT_PIPELINE.md — what a prompt looks like.
- EVALUATOR.md — how scoring works.
- QD_GRADIENT.md — the optional gradient layer.
- PSEUDOCODE.py — a single-file runnable sketch with representative starter pattern dictionaries and instruction snippets. Wire in an LLM client, a build, and a test harness; the rest is complete.
Each supporting doc cross-links back here. None of them refer outside this folder.
Empirical anchor (Intel B580 / matmul DPAS, in-house run)
A representative deep run of this algorithm produced these numbers, which serve as a sanity check that the loop, configured as documented, actually converges:
| Setting | Value |
|---|---|
| Hardware | Intel B580 (Battlemage G21, 20 Xe2 cores) |
| Problem | C[2048, 2048] = A[2048, 2560] × B[2560, 2048], FP16 |
| Reference (naive scalar) | 34.0 ms |
| LLM | Claude Opus 4.6 (one-shot per call, no tools) |
| Iterations | 7 trials × 4 branches = 28 candidates |
| Total wall time | 1115.7 s (18.6 min) |
| Compile success rate | 96.4 % (27 / 28) |
| Correctness pass rate | 89.3 % (25 / 28) |
| Best runtime | 1.22 ms |
| Best speedup vs reference | 27.9 × |
| XMX peak utilization | 18.3 % (of 96 TFLOPS theoretical) |
| Estimated total tokens | |
| Critical breakthrough | trial 4 — model discovered intel_sub_group_block_read_us for B loads + restrict, after mutate from the prior best |
Not a benchmark — an existence proof. Read it as: "the algorithm described in the rest of this skill, configured at default settings, can take a 34 ms naive matmul to 1.22 ms in 7 LLM-driven trials." The loop will not always achieve this; randomness in the LLM and the search matters. But the breakthrough at trial 4 is a textbook example of the optimization-aware prompt landing the model on the right cell after two mutate steps from the prior best, exactly as item 3 above describes.
Re-implementation effort
Roughly:
| Scope | LoC |
|---|---|
| Minimum viable loop (single island, no QD gradient, T0–T3 only) | ~500 |
| Production-quality, all six layers, multi-backend | 1500–2500 |
Most of the size in a production port goes into (a) backend-specific
regex pattern dictionaries (each backend needs its own), (b) backend-
specific instruction text per (dim, level), and (c) the per-task build
and test harness. The control loop itself is small.