name: cpu-kernels description: "Provides guidance for writing, optimizing, and benchmarking C++ CPU kernels with SIMD intrinsics (AVX2/AVX512) for the Hugging Face kernels ecosystem. Includes a two-phase workflow: Phase 1 correctness (generic → AVX2) and Phase 2 performance exploration (AVX512 with branching trial loop), runtime CPU dispatch, OpenMP threading, and brgemm integration for GEMM-heavy kernels." disable-model-invocation: false user-invocable: true allowed-tools: "Read, Grep, Glob, Bash" argument-hint: "kernel type: rmsnorm, flash-attention, quantized-gemm, activation, reduction, optimize, benchmark"
CPU C++ Kernels for x86 Processors
This skill provides patterns and guidance for developing optimized C++ kernels targeting x86 CPUs (Intel Xeon and compatible processors) with AVX2 and AVX512 intrinsics. Kernels are compiled via kernel-builder and distributed through the Hugging Face kernels ecosystem.
Who runs these commands? You, the agent — not a human. This is an autonomous loop: you write/edit the C++ kernel, build it, then run the scripts below as tools (via Bash) to check correctness, benchmark, and profile. You read each result, record it with
trial_manager.py, decide the next change from the Phase 2 decision tree, and repeat until you hitearly_stop_speedupor run allmax_trials.
Key Concepts (read before the Quick Start)
The commands use a few names that mean different things. They are not interchangeable:
| Name (example) | What it is | Used by |
|---|---|---|
baseline.py |
The PyTorch reference implementation you optimize against. It is the ground truth for correctness and the speed reference for speedup. It must define get_inputs() and either get_reference_output() or a Model class (plus optional get_init_inputs()). You write this file (or it is given) before starting. |
every script |
my_rmsnorm |
A trial-tree label — an arbitrary name you pick for this optimization task. trial_manager.py stores all attempts under trials/my_rmsnorm/. It is only a tracking ID. |
trial_manager.py only |
my_kernel |
The installed Python package name — the build artifact produced by kernel-builder build + pip install. This is the importable module that contains your compiled kernel. |
--kernel-package |
my_kernel.rms_norm |
An <package>.<function> path — the actual callable inside the installed package. Passed to --op to tell the benchmark/profiler which function to run. |
--op |
⚠️
--opmeans two different things depending on the script. Inanalyze_op.py,--opis a plain operation name (e.g."rms_norm") used to look up compute/memory characteristics. Inbenchmark_cpu.pyandcpu_profiler.py,--opis apackage.functionpath (e.g.my_kernel.rms_norm) used to import and call your kernel. Same flag, different meaning — read each command below carefully.
Quick Start
Write a New CPU Kernel
The example below optimizes an RMSNorm kernel. The trial label is my_rmsnorm, the built package is my_kernel, and its function is my_kernel.rms_norm — keep these consistent across all six steps.
# 1. Analyze the target op. Here --op is an OPERATION NAME (looked up in the
# knowledge base), not a package path.
python scripts/analyze_op.py --op "rms_norm" --shapes "1024x4096,2048x8192"
# 2. Initialize trial tracking. Args: <trial-label> <baseline-file>.
# Creates trials/my_rmsnorm/ and records baseline.py as the reference.
python scripts/trial_manager.py init my_rmsnorm baseline.py
# 3. Build the kernel package (produces the installable 'my_kernel' wheel).
cd /path/to/my-kernel && kernel-builder build --release && pip install dist/*.whl --force-reinstall
# 4. Benchmark correctness + performance. Here --op is a PACKAGE.FUNCTION path.
# Compares my_kernel.rms_norm against baseline.py (correctness + speedup).
python scripts/benchmark_cpu.py baseline.py --kernel-package my_kernel --op my_kernel.rms_norm
# 5. Profile with perf stat (same package.function path as step 4).
python scripts/cpu_profiler.py --kernel-package my_kernel --op my_kernel.rms_norm
# 6. Finalize: promote the best trial in trials/my_rmsnorm/ into output/.
python scripts/trial_manager.py finalize my_rmsnorm output/
Supported Hardware
| ISA | Extensions | Key Instructions | Typical CPUs |
|---|---|---|---|
| AVX2 | FMA, F16C | _mm256_fmadd_ps, _mm256_cvtph_ps |
Most x86 CPUs (2013+) |
| AVX512 | F, BF16, VL, DQ, BW, VBMI | _mm512_dpbf16_ps, _mm512_permutexvar_epi16 |
Intel Xeon |
GEMM Acceleration: brgemm
For kernels that involve matrix multiplication (quantized GEMM, Flash Attention, MoE), large-M cases use at::native::cpublas::brgemm() — a PyTorch wrapper around oneDNN brgemm, which internally dispatches to AMX tile instructions on Intel Xeon (4th Gen+). Small-M cases (M ≤ 4 for bf16) fall back to hand-written tinygemm using AVX512 _mm512_dpbf16_ps. See brgemm_patterns.yaml for details.
Note: brgemm is NOT used in element-wise kernels (RMSNorm, activations, reductions). Those use AVX512 intrinsics directly.
When This Skill Applies
Use this skill when:
- Writing C++ CPU kernels with SIMD intrinsics for the HF kernels ecosystem
- Optimizing existing CPU kernels (e.g., adding AVX512 to a generic implementation)
- Implementing quantized GEMM kernels (INT4, NF4, FP4, FP8, MXFP4)
- Implementing Flash Attention or other attention kernels for CPU
- Building kernels with
kernel-builderthat targetbackend = "cpu"
Two-Phase Optimization Workflow
CPU kernel development has two distinct phases with different strategies.
Configuration — Read config.yaml first
At the start of every session, read scripts/config.yaml. It controls:
max_trials— hard cap on Phase 2 optimization trialsearly_stop_speedup— speedup vs PyTorch baseline to trigger early stop (default: 3.0)perf_stat_enabled— iftrue, useperf statfor profiling (default)vtune_enabled— iftrue, use VTune for detailed microarchitecture analysisbuild_command— command to build the kernel package
Rules — Never Violate
- ONLY modify C++ kernel files (
.cpp,.hpp),torch_binding.cpp, andbuild.toml. Do NOT create benchmark or test scripts. - NEVER write custom timing code — ONLY use
scripts/benchmark_cpu.py. - If a tool fails, STOP and report the error. Do NOT work around it with custom scripts.
- Generated kernels must follow the runtime dispatch pattern with
cpu_features.hpp— seereferences/runtime_dispatch.yaml. - Every kernel should have a generic ATen fallback that works on any CPU. If a specific path cannot have a meaningful fallback, use
TORCH_CHECK(false, ...)with a clear error message. - Each SIMD tier (AVX2, AVX512) must be in a separate translation unit (
.cppfile) with its own compiler flags inbuild.toml. Do NOT mix intrinsics from different ISA levels in the same file. - All SIMD implementations must handle edge cases (hidden_size not divisible by vector width).
- AVX2 tier is optional — most CPU kernels go directly from generic fallback to AVX512. Only add AVX2 when it provides meaningful benefit for element-wise ops.
- You MUST run all
max_trialstrials in Phase 2. Do NOT stop early due to plateau — the only valid early stop is speedup >early_stop_speedup.
Mandatory Tools
| Tool | Command | Purpose |
|---|---|---|
| Analyze | python scripts/analyze_op.py --op <op_name> --shapes <shapes> |
Analyze PyTorch op: compute/memory characteristics, SIMD strategy recommendations |
| Validate | python scripts/validate_cpu_kernel.py <kernel_dir> |
Static checks: alignment, OpenMP usage, intrinsics correctness, build.toml validation |
| Build | kernel-builder build --release |
Compile C++ kernel via build.toml into a wheel |
| Benchmark | python scripts/benchmark_cpu.py <baseline_file> --kernel-package <pkg> --op <func> |
Correctness + performance via torch.utils.benchmark |
| Profile | python scripts/cpu_profiler.py --kernel-package <pkg> --op <func> |
perf stat hardware counters + optimization recommendations |
| Trial Manager | python scripts/trial_manager.py <command> ... |
Trial tree management (init/save/result/status/best/finalize) |
Benchmark discipline: Pin to a single NUMA node —
numactl --cpunodebind=0 --membind=0 python scripts/benchmark_cpu.py .... See threading_patterns.yaml.
Phase 1: Correctness (Linear, No Branching)
Build the kernel tier by tier. Each tier must be correct before moving on.
Tier 0: Generic Fallback
- Implement using PyTorch ATen ops only (no intrinsics).
- This serves as the portable baseline that runs on any CPU.
- Must produce results matching the PyTorch reference within tolerance.
- File:
<kernel>_cpu/<kernel>_cpu.cpp
# Validate + build
python scripts/validate_cpu_kernel.py .
kernel-builder build --release
pip install dist/*.whl --force-reinstall
# Benchmark (this also establishes the PyTorch baseline time)
python scripts/benchmark_cpu.py baseline.py --kernel-package my_kernel --op my_kernel.rms_norm
Tier 1: AVX2 (Optional)
- Add AVX2 implementation using
_mm256_*intrinsics. - Compile with
-mavx2 -mfma -mf16c -fopenmp. - Must be correct; performance improvement is a bonus.
- File:
<kernel>_cpu/<kernel>_avx2.cpp
Tier 2: AVX512
- Add AVX512 implementation using
_mm512_*intrinsics. - Compile with
-mavx512f -mavx512bf16 -mavx512vl -mavx512dq -mavx512bw -mavx512vbmi -mfma -mf16c -fopenmp. - For GEMM kernels using brgemm, additionally add
-mamx-tile -mamx-bf16 -mamx-int8. - This is the entry point to Phase 2 — performance optimization starts here.
- File:
<kernel>_cpu/<kernel>_avx512.cpp
Phase 2: Performance Exploration (Branching, With Backtracking)
Once AVX512 is correct, optimize for peak performance. This phase uses the trial manager with branching.
# Initialize Phase 2 trials
python scripts/trial_manager.py init <kernel_name> baseline.py
# For each trial:
# 1. Modify kernel code (AVX512 intrinsics, or brgemm for GEMM kernels)
# 2. Build
kernel-builder build --release && pip install dist/*.whl --force-reinstall
# 3. Benchmark
python scripts/benchmark_cpu.py baseline.py --kernel-package <pkg> --op <func>
# 4. Save trial
python scripts/trial_manager.py save <kernel_name> <dir> --parent <parent_id> --strategy "description"
# 5. Record result
python scripts/trial_manager.py result <kernel_name> <trial_id> --correctness pass --speedup <float> --baseline_us <float> --kernel_us <float>
# 6. Profile (after t1, or when plateaued)
python scripts/cpu_profiler.py --kernel-package <pkg> --op <func>
Phase 2 Decision Tree
| Condition | Action |
|---|---|
Speedup > early_stop_speedup |
Stop — excellent result (the only valid early stop) |
| Speedup improved | Continue on this branch, try next optimization |
| Speedup regressed | Branch back to best trial, try different strategy |
| Correctness failed | Fix on same branch (usually alignment or SIMD boundary bug) |
After t1 (if perf_stat_enabled) |
Run cpu_profiler.py — mandatory first profile |
| IPC < 1.0 | Memory bound → add prefetch, change cache blocking |
| L1 miss rate high | Tile too large for L1 → reduce tile size |
| L3 miss rate high | Working set too large → add cache blocking |
| Plateau after 2+ trials | Do NOT keep tuning the same knobs. Change the approach: switch algorithm path (tinygemm ↔ brgemm), change the fusion/blocking/data-layout strategy, or reconsider the dispatch heuristic. A different structure beats endless parameter sweeps. |
| Max trials reached | Stop — must run all max_trials from config.yaml |
Optimization Search Space (Phase 2)
These tables are a starting menu of values seen in existing kernels, not an exhaustive recipe. Use them to seed trials, but when a branch plateaus, prefer a structurally different idea (algorithm, fusion, memory strategy) over sweeping these knobs further. See the try-harder tree in optimization_levels.yaml.
GEMM kernels (quantized GEMM, Flash Attention, MoE):
| Dimension | Actual Values in Existing Kernels | Notes |
|---|---|---|
| BLOCK_M (tinygemm path) | 4 | Small M, fused dequant+GEMM |
| BLOCK_M (brgemm path) | 32 (= 2×TILE_M) | Must be multiple of TILE_M=16 |
| BLOCK_N (tinygemm) | 32, 64 | Determines register tile COLS = BLOCK_N/16 |
| BLOCK_N (brgemm) | 32 (= 2×TILE_N) | Must be multiple of TILE_N=16 |
| BLOCK_K | 128 (= 4×TILE_K) | K-dimension blocking |
| BLOCK_M/N (flash-attn2) | 256 / 768 | Much larger — attention-specific |
| K-loop unroll | 4 (#pragma GCC unroll 4) |
All GEMM kernels use 4 |
| Prefetch distance | 0 (disabled), 64 elements ahead | L1 prefetch via _MM_HINT_T0 |
| Algorithm path | use_brgemm threshold (e.g. M > 4) |
Switch from tinygemm to brgemm |
| brgemm dequant policy | use_brgemm_dequant_out (e.g. M > 100) |
True=pre-dequant all B upfront; False=dequant per K-block |
| L2 cache budget | 1 MB (50% of 2 MB L2) | Controls N-blocking in loop_2d |
| Thread decomposition | 2D factorization: nth_m × nth_n | Based on M/N aspect ratio |
Element-wise kernels (RMSNorm, activations):
| Dimension | Actual Values | Notes |
|---|---|---|
| Vectorization width | 16 (fp32), 32 (bf16) | Per-type VEC_ELEM_NUM |
| Prefetch hint | _MM_HINT_T1 (L2) |
Different from GEMM (L1) |
| OpenMP grain size | 1024 | at::parallel_for grain |
| Threading | #pragma omp parallel for over rows |
Simple 1D parallelism |
Phase 2 Finalization
python scripts/trial_manager.py finalize <kernel_name> output/
# Re-run benchmark without cached baseline for final accurate comparison
python scripts/benchmark_cpu.py baseline.py --kernel-package <pkg> --op <func>
Reference Docs — Read During Phase 1
| Doc | Contents |
|---|---|
references/runtime_dispatch.yaml |
cpu_features.hpp pattern, dispatch tiers |
references/build_system.yaml |
build.toml multi-target CPU compilation |
references/implementation_reference.md |
C++ kernel templates, Unroll<N>, tinygemm, torch_binding.cpp |
references/correctness.yaml |
Critical constraints: alignment, FTZ/DAZ, denormals |
Reference Docs — Read During Phase 2
| Doc | Contents |
|---|---|
references/simd_optimization_patterns.yaml |
AVX2/AVX512 vector abstractions and patterns |
references/quantized_gemm_patterns.yaml |
LUT + tinygemm + Unroll template for 4-bit GEMM |
references/brgemm_patterns.yaml |
brgemm API usage, VNNI packing, tinygemm vs brgemm selection (GEMM kernels only) |
references/memory_patterns.yaml |
Prefetch, alignment, cache blocking |
references/threading_patterns.yaml |
OpenMP parallel patterns |
references/dtype_optimizations.yaml |
bf16/fp8/int8 handling and conversion on CPU |
references/optimization_levels.yaml |
Progressive L1→L5 optimization checklist + try-harder tree |
references/optimization_strategies.md |
Strategy reference, decision tree, checklist |
references/workflow_details.md |
Detailed trial loop workflow |
references/huggingface-kernels-integration.md |
Hub integration for CPU kernels |
Core CPU Kernel Patterns
Runtime Dispatch (Required for All Kernels)
Every CPU kernel has its own cpu_features.hpp (in its own namespace) and dispatches at runtime. Most kernels dispatch as AVX512 → fallback (no AVX2 tier):
// my_kernel_cpu/cpu_features.hpp — each kernel has its OWN copy
namespace my_kernel_cpu {
class CPUFeatures {
public:
static bool hasAVX512BF16() { /* CPUID + XCR0 checks */ }
static bool hasAVX2() { /* CPUID check */ }
// GEMM kernels also check: static bool hasAMX() { ... }
};
}
// my_kernel_cpu/my_kernel_cpu.cpp — dispatcher
#include "cpu_features.hpp"
#include "my_kernel_avx512.hpp"
void my_kernel(torch::Tensor& out, const torch::Tensor& input, ...) {
if (CPUFeatures::hasAVX512BF16()) {
avx512::my_kernel_impl(out, input, ...);
} else {
// ATen fallback — inline or in a separate _fallback.cpp
out = torch::some_aten_op(input, ...);
}
}
Note: Only rmsnorm has a three-tier dispatch (AVX512 → AVX2 → ATen). GEMM kernels skip AVX2. Flash-attn2 additionally requires AMX via
hasAllRequiredFeatures().
Full pattern: runtime_dispatch.yaml
build.toml Multi-Target Compilation
Each SIMD tier is a separate [kernel.*] section with its own compiler flags. The include directive is required for header resolution:
[kernel.my_kernel_cpu]
backend = "cpu"
depends = ["torch"]
include = ["my_kernel_cpu"]
src = [
"my_kernel_cpu/my_kernel_cpu.cpp",
"my_kernel_cpu/my_kernel_cpu_torch.cpp",
"my_kernel_cpu/my_kernel_cpu.hpp",
"my_kernel_cpu/cpu_features.hpp",
]
[kernel.my_kernel_cpu_avx512]
backend = "cpu"
# Note: For GEMM kernels (e.g., flash-attn2, megablocks), you must also include "-mamx-tile", "-mamx-bf16", "-mamx-int8"
cxx-flags = ["-mavx512f", "-mavx512bf16", "-mavx512vl", "-mavx512dq", "-mavx512bw", "-mavx512vbmi", "-mfma", "-mf16c", "-fopenmp"]
depends = ["torch"]
include = ["my_kernel_cpu"]
src = [
"my_kernel_cpu/my_kernel_avx512.cpp",
"my_kernel_cpu/my_kernel_avx512.hpp",
]
Note: Every section needs
include = ["<kernel_dir>"]for header resolution. The_torch.cppfile bridges Python-facing declarations to the C++ dispatcher. AVX2 section is optional (only rmsnorm has one).
Full pattern: build_system.yaml
torch_binding.cpp Registration
All kernels use registration.h macros for op registration:
#include "registration.h"
// Forward declarations
#if defined(CPU_KERNEL)
torch::Tensor my_kernel_cpu_forward(torch::Tensor input, torch::Tensor weight, float eps);
#endif
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.def("forward(Tensor input, Tensor weight, float eps) -> Tensor");
ops.impl("forward", torch::kCPU, &my_kernel_cpu_forward);
}
REGISTER_EXTENSION(TORCH_EXTENSION_NAME)
Note:
registration.his provided by kernel-builder. Multi-device kernels (rmsnorm, megablocks) use#if defined(CPU_KERNEL)/#elif defined(CUDA_KERNEL)guards.
Vector Type Abstractions (AVX512)
Wrap raw intrinsics in typed vector classes for readability:
// cpu_types_avx512.hpp
struct FP32Vec16 {
__m512 reg;
FP32Vec16(float v) : reg(_mm512_set1_ps(v)) {}
FP32Vec16(__m512 r) : reg(r) {}
FP32Vec16 operator*(const FP32Vec16& other) const {
return FP32Vec16(_mm512_mul_ps(reg, other.reg));
}
float reduce_sum() const { return _mm512_reduce_add_ps(reg); }
};
Full pattern: simd_optimization_patterns.yaml
Quantized GEMM Template (INT4/NF4/FP4)
All 4-bit quantized GEMM kernels share the same skeleton — only the LUT and zero-point handling differ:
nibble split → zero subtract → LUT lookup → _mm512_dpbf16_ps accumulate → scale fmadd (per group) → bf16 output
The parameterized components:
- LUT: GPTQ (linear INT4), BnB (NF4/FP4), MegaBlocks (FP8/MXFP4)
- Zero-point: per-group (GPTQ), none/encoded in LUT (BnB), per-block (FP8)
- Algorithm: tinygemm (small M, fused) vs brgemm (large M, unpack+BLAS)
- Weight conversion: The C++ kernel expects a specific block-interleaved format, NOT raw checkpoint format. Each framework converts in its own repo:
- GPTQ:
transform_cpu()unpacks int32→uint8, reorders by g_idx, transposes to [N,K]; thenconvert_weight_packed_zp()repacks to [N,K/2] block-interleaved (BLOCK_N=32). Zeros unpacked to [groups,N] uint8. Scales to bf16. Done at first forward in GPTQModel repo. - BnB:
_convert_weight_packed_for_cpu()unpacks uint8 nibbles→[N,K], repacks to [N,K/2] block-interleaved (same algo as GPTQ). Denests nested absmax. Transposes scales to [K/blocksize,N] bf16. Done at first forward in bitsandbytes repo. - Megablocks MoE:
ops.convert_weight_packed()does transpose+VNNI pack.ops.convert_scale_packed()reorders scales. Cached viapacked_weight=True.
- GPTQ:
- VNNI Conversion (K/V Activations):
- Flash Attention:
pack_vnni()per tile per forward (K/V change every call, so caching is not possible).
- Flash Attention:
- Element-wise (RMSNorm): No conversion needed.
Full pattern: quantized_gemm_patterns.yaml, weight conversion: brgemm_patterns.yaml
Critical CPU Constraints
- Always use unaligned loads: All existing kernels use
_mm512_loadu_*exclusively. Never use_mm512_load_*. - Edge cases: When
hidden_size % VEC_ELEM_NUM != 0, handle the tail with scalar or masked SIMD ops. - FTZ/DAZ: Flush-to-zero and denormals-as-zero may be set by PyTorch. Do NOT assume IEEE 754 denormal behavior.
- OpenMP overhead: For small tensors, use
adjust_num_threads(m)to reduce thread count. GEMM kernels useparallel_2dfor 2D thread decomposition. - bf16 precision:
_mm512_dpbf16_psaccumulates in fp32 but inputs are bf16 — precision loss is expected. Use atol=1e-2 for correctness checks. - Data alignment: Use
alignas(64)for stack-allocated tile buffers to optimize cache-line access.
Full constraint list: correctness.yaml
Common Issues
| Issue | Symptom | Fix |
|---|---|---|
| Unaligned access | SEGFAULT or wrong results | Use _mm512_loadu_* instead of _mm512_load_* |
| Missing tail handling | Wrong results for non-aligned sizes | Add scalar loop for remainder elements |
| OpenMP on small tensor | Slower than baseline | Add if (num_tokens > threshold) guard |
| Wrong compiler flags | Intrinsics not recognized | Check build.toml cxx-flags matches code |
Silent scalar at::vec |
Kernel ~2x slow, no error; objdump shows 0 %zmm / nm shows expf@GLIBC |
Define CPU_CAPABILITY_AVX512 for TUs using at::vec::Vectorized (see build_system.yaml) |
| CPUID detection wrong | Crashes on older CPU | Verify cpu_features.hpp checks OS support (XCR0) |
Project Structure
cpu-kernels/
├── SKILL.md # This file (skill definition + workflow)
├── manifest.txt # Files included in this skill
│
├── scripts/ # Standalone CLI tools
│ ├── analyze_op.py # PyTorch op → compute/memory analysis
│ ├── validate_cpu_kernel.py # Static checks on C++ kernel code
│ ├── benchmark_cpu.py # Correctness + performance via torch.utils.benchmark
│ ├── cpu_profiler.py # perf stat hardware counters + recommendations
│ ├── trial_manager.py # Tree-structured trial management
│ ├── config.yaml # Session config (max_trials, profiler, build)
│ └── config.py # Shared configuration loader
│
└── references/ # Knowledge base
├── correctness.yaml # Critical constraints for CPU kernels
├── runtime_dispatch.yaml # cpu_features.hpp + dispatch pattern
├── build_system.yaml # build.toml multi-target CPU compilation
├── simd_optimization_patterns.yaml # AVX2/AVX512 vector abstractions and patterns
├── quantized_gemm_patterns.yaml # LUT + tinygemm/brgemm template
├── brgemm_patterns.yaml # brgemm API, VNNI packing, tinygemm fallback (GEMM kernels only)
├── memory_patterns.yaml # Prefetch, alignment, cache blocking
├── threading_patterns.yaml # OpenMP parallel patterns
├── dtype_optimizations.yaml # bf16/fp8/int8 handling on CPU
├── optimization_levels.yaml # Progressive L1→L5 optimization checklist
├── implementation_reference.md # C++ kernel templates and examples
├── optimization_strategies.md # Strategy reference + decision tree
├── workflow_details.md # Detailed workflow reference
└── huggingface-kernels-integration.md # HF kernels ecosystem integration guide
See Also
Tools
- analyze_op.py — Analyze PyTorch op characteristics
- validate_cpu_kernel.py — Static kernel validation
- benchmark_cpu.py — Correctness + performance measurement
- cpu_profiler.py — perf stat hardware counters
- trial_manager.py — Trial tree management
CPU Optimization References
- correctness.yaml — Critical constraints
- simd_optimization_patterns.yaml — SIMD patterns
- quantized_gemm_patterns.yaml — Quantized GEMM template
- optimization_levels.yaml — Progressive optimization
- implementation_reference.md — Code templates
External Resources
- Hugging Face Kernels — Kernel hub and builder CLI
- Intel Intrinsics Guide
- kernel-builder Documentation
- xpu-kernels skill — the Intel XPU Triton skill this workflow was adapted from
- Xe-Forge — the LLM-driven optimization framework the skill methodology originates from
Acknowledgments
The methodology of this skill — the YAML knowledge base, the benchmark/validation harnesses, and the branching trial-manager optimization loop — was adapted from the xpu-kernels skill built by a group of Intel AI researchers, the IntelLabs team behind Xe-Forge, where the methodology originates. Thanks to the original authors for a solid foundation to build on.