name: compute-llk-bringup
description: Specialization of /implement-mock for compute-kernel LLK shims (<op>_tile / <op>_tile_init under include/jit_hw/api/compute/). Use when bringing up a missing compute op or addressing its PCC failures.
Compute LLK bring-up
This skill is a specialization of /implement-mock — specifically,
Strategy A (stub in jit_hw/) applied to compute-kernel ops. The
mechanics here all assume the silicon API lives at
tt_metal/hw/inc/api/compute/<path>.h and produces a header-only shim
under include/jit_hw/api/compute/.
For mock work outside compute (NOC, dataflow, dispatch, fabric, tensor),
use /implement-mock for the broader strategy-picking flow. For batch
bring-up (≥4 shims at once), see /parallel-mock-implementation.
When to invoke
User says one of:
- "bring up an LLK" / "bring up the X op"
- "write a shim for X" / "add an emule shim for X"
- "address the PCC failures on X"
- "implement the missing X SFPU op"
Or you encounter a JIT compile error of the form
jit_compile_kernel: compiler failed (exit 256) for kernel: ...
where the underlying clang error is a missing <name>_tile symbol or an
api/compute/<name>.h not-found.
The bring-up loop
For each target op:
- Triage: does the shim already exist?
- Identify the target path.
- Read upstream — lock signatures and semantics.
- Write the shim using the standard pattern.
- Wire it in (if needed).
- Rebuild + run targeted tests.
- Promote passes into
scripts/run_ttnn_pytests.sh. Commit. - PCC failures: investigate up to ~3 iterations, then defer.
Step 1 — Triage
Before writing anything, check whether the op is already covered. Many "missing" ops are in shared shims. The fastest check is the early-detect probe:
python3 scripts/find_symbol.py --supports <op>_tile
# <op>_tile tile layer1 include/jit_hw/api/compute/... → already shimmed, STOP
# <op>_tile tile needs_stub - → genuinely missing, bring it up
--supports reads the canonical .claude/references/structure.yaml index
(regenerated + check-gated by scripts/gen_structure.py — so it can't drift) and
tells you both whether the symbol exists and the verdict a kernel using it
lands at. Plain scripts/find_symbol.py <op>_tile (or a grep of the index) also
works to see which file defines it. For the wider "what should we bring up
next?" worklist, scripts/classify_kernels.py rolls every kernel up to
layer1 / needs_stub / ruled_out.
Common shared homes (not exhaustive):
include/jit_hw/api/compute/eltwise_unary/activations.h— small, closely-related per-element ops grouped together. If the op is here, a separate per-op file is redundant and will cause an ODR conflict.include/jit_hw/api/compute/compute_kernel_api.h— single-tile fused ops that upstream's compute_kernel_api.h exposes as catch-alls (sign, sigmoid, silu, exp2, expm1, log, square, power, topk stub, etc.).include/jit_hw/api/compute/eltwise_unary/relu.h— the RELU_FAMILY (relu, relu_max, relu_min, leaky_relu) shares one file.include/jit_hw/api/compute/eltwise_unary/sfpu_split_includes.h— the wiring file: lists whichSFPU_OP_<NAME>_INCLUDEbranches are active.
If your op is already covered, stop. You'd be duplicating.
Step 2 — Target path
Shims mirror upstream paths under tt_metal/hw/inc/api/compute/:
| Upstream | Emule target |
|---|---|
tt_metal/hw/inc/api/compute/eltwise_unary/<name>.h |
include/jit_hw/api/compute/eltwise_unary/<name>.h |
tt_metal/hw/inc/api/compute/<name>.h (fused / multi-tile) |
include/jit_hw/api/compute/<name>.h |
Functions in upstream activations.h |
Add to emule eltwise_unary/activations.h in place |
Functions in upstream compute_kernel_api.h |
Add to emule compute_kernel_api.h in place |
Don't invent paths. If upstream has no header for the op (e.g. some
activations are composed in-kernel from primitives, no standalone <op>_tile),
no shim is needed. Verify the ttnn op handler decomposes it via SFPU_OP_CHAIN
rather than calling a standalone <name>_tile.
Step 3 — Read upstream
Three references per op, in order of priority:
- Upstream header at the matching path. Lock the signature(s) verbatim — parameter names, types, defaulted args.
- LLK implementation at
tt_metal/tt-llk/tt_llk_wormhole_b0/llk_lib/llk_math_eltwise_unary_sfpu_<name>.h(or, for actual numerics,tt_metal/hw/ckernels/wormhole_b0/metal/llk_api/llk_sfpu/ckernel_sfpu_<name>.h). Source of truth for the math formula — polynomial coefficients, Cody-Waite reductions, saturation branches. - A reference emule shim for the boilerplate. Pick by category if possible — see
.claude/references/structure.yamlfor the catalog. Useful representatives:- per-element stateless:
eltwise_unary/relu.h - thread_local state:
eltwise_unary/dropout.h - int32 DST: any
eltwise_unary/bitwise_*.h - in-place temp-buffer:
transpose_wh.h
- per-element stateless:
Hard input cap: ~3 files. If you find yourself reading 4+ host-side files
or wandering into tt_metal/llrt/, tt_metal/impl/dispatch/, or
tt_metal/soc_descriptors/ — stop and reclassify the task. Host code is out
of shim scope.
Step 4 — The shim template
Don't write a shim from scratch — pick an existing one whose pattern matches
and copy its skeleton. The standard pieces (#pragma once, SPDX header,
#include "jit_hw/api/compute/common.h", namespace ckernel { ... },
ALWI void <name>_tile_init(...) {}, __emule_dst_check(idst, "<name>_tile"),
for (i = 0; i < __EMULE_TILE_ELEMS; i++) over __emule_dst[idst][i]) are
the same across all of them — they're already correct in the existing files
and stay correct as common.h evolves.
Pick by pattern if possible. It might not always be possible. .claude/references/structure.yaml
catalogs every shim under include/jit_hw/api/compute/ with its top-level
symbols — grep there to find the closest match to your target op's shape,
then open that file as the boilerplate template.
Some common patterns and representative shims:
| Pattern | Representative shim(s) |
|---|---|
| Per-element stateless float math | eltwise_unary/relu.h, eltwise_unary/cbrt.h, eltwise_unary/mish.h |
Per-element with fp32 bit-pattern params (decode via std::memcpy) |
eltwise_unary/threshold.h, eltwise_unary/elu.h, eltwise_unary/hardtanh.h |
Int32 DST (via __emule_dst_load_i32 / __emule_dst_store_i32) |
eltwise_unary/bitwise_and.h, eltwise_unary/left_shift.h |
| Thread-local PRNG / per-call state | eltwise_unary/dropout.h (xorshift32) |
| Per-tile accumulator across calls | cumsum.h, cumprod.h |
| Multi-tile inputs (one call reads several DST slots) | mask.h, welford.h, logsigmoid.h |
| In-place mutation requiring a temp buffer | transpose_wh.h, reshuffle.h |
| Composed re-export (no own logic) | softmax.h |
| Ported upstream polynomial (region split + Horner form) | eltwise_unary/i1.h, eltwise_unary/lgamma.h |
Keep comments brief. Per project convention (see CLAUDE.md): default to
no comments, write one only when the why is non-obvious. Concretely for
shims:
- The file path already tells the reader what's being shimmed — no need to
spell out "Intercepts the upstream include path which pulls in
<llk_header>.h(an LLK-only header that references SFPU intrinsics)". - Don't restate the math the loop body already does (no
// p(t) = (((c2*t + c1)*t + c0)running commentary on a Horner expansion). - Don't restate accessor semantics (no
// Spill mean to DST[mean_dst_idx]next to astd::memcpythat obviously spills). - DO keep: a one-line "what this op computes" or "what the encoded params
mean" if the function signature alone is ambiguous; a single
Real LLK:pointer to the silicon source-of-truth for math-heavy ports.
Target header docblock for a typical shim is 2–4 lines: one-line op
summary, one-line encoded-param note if any, one Real LLK: line.
Step 5 — Wire-up
If upstream has a corresponding SFPU_OP_<NAME>_INCLUDE guard in
tt_metal/hw/inc/api/compute/eltwise_unary/sfpu_split_includes.h, add the
same branch to emule's sfpu_split_includes.h:
#if SFPU_OP_<NAME>_INCLUDE
#include "api/compute/eltwise_unary/<name>.h" // or compute/<name>.h
#endif
If the upstream path is direct (no SFPU_OP gate) — e.g. api/compute/softmax.h,
api/compute/cumsum.h, api/compute/welford.h — no wiring needed; the JIT
include path resolves the emule version automatically when the kernel does
#include "api/compute/<name>.h".
Per the project rule in CLAUDE.md: when you add a source file or a
top-level symbol, refresh the index in the same change with
python3 scripts/gen_structure.py --write (don't hand-edit symbols — the
pre-commit hook / CI --check will fail if the index is stale). A brand-new
file also needs a one-line summary (the generator inserts a TODO: sentinel
until you write one).
Step 6 — Build + test
Build + base smoke test command lines live in BUILD_GUIDE.md.
Run the standard build, then a targeted pytest invocation for the op's test file.
Op → test-file mapping (typical):
| Op family | Test file under tests/ttnn/unit_tests/operations/ |
|---|---|
| activations (hardtanh, mish, threshold, swish, gelu, …) | eltwise/test_activation.py::test_<name> |
| math (cbrt, i0, i1, erfinv, digamma, polygamma, lgamma) | eltwise/test_math.py::test_<name> |
| elu, celu | eltwise/test_elu.py, eltwise/test_celu_21f.py |
| binary scalar (fmod, remainder, rsub) | eltwise/test_binary_composite.py, eltwise/test_binaryng_fp32.py |
| bitwise | eltwise/test_binary_int32.py |
| reduce / accumulation (cumsum, cumprod, mean, min, sum) | reduce/test_<name>.py |
| fused (softmax, layernorm) | fused/test_<name>.py |
Project conventions (from CLAUDE.md): wormhole N150, slow dispatch always.
Standard env vars and pytest invocation are in BUILD_GUIDE.md.
When debugging a failing kernel, set TT_EMULE_KEEP_JIT_SRC=1 and inspect
the kept patched_kernel.cpp / wrapper.cpp under /tmp/tt_emule_jit_*/.
Step 7 — Promote into the regression script
Edit scripts/run_ttnn_pytests.sh. Add one run_pytest line per
cleanly-passing function, placed near the relevant family of existing
entries:
run_pytest "elt_test_<name>" "$ELT_TEST_DIR/test_<group>.py::test_<name>"
Or with a parametrize filter:
run_pytest "elt_test_<name>" "$ELT_TEST_DIR/test_<group>.py::test_<name>" -k 'not sharded'
Caveats on pytest -k:
- Does NOT tokenize
=or::—K=128won't parse as a single identifier. - DOES tokenize plain alphanumeric+underscore substrings.
- For
--deselect, use rootdir-relative node IDs (tests/...), not absolute paths. Absolute paths silently fail to deselect.
Commit per bring-up with a concise message:
git add include/jit_hw/<files> .claude/references/structure.yaml scripts/run_ttnn_pytests.sh
git commit -m "<shim-name>: <one-line summary> (<n>/<n> pass)"
Step 8 — PCC failure triage
Hard cap: ~3 iterations per shim. Past that, defer with a written note.
When PCC fails, check in order:
Verify upstream signatures literally before anything else. Open the upstream header and compare the
<name>_tilesignature byte-for-byte against the emule shim. Inline notes about parameter ordering can be wrong; the committed upstream code is the source of truth.Polynomial approximation gap.
<cmath>≠ upstream LLK. Readtt_metal/hw/ckernels/wormhole_b0/metal/llk_api/llk_sfpu/ckernel_sfpu_<name>.hfor the actual coefficients. Port them — see "Porting upstream polynomials" below.Parameter ordering. Compare your shim's args against the upstream
ALWI void <name>_tile(...)declaration in the upstream header.Stateful accumulator direction. For cumsum/cumprod/welford, check upstream's tile-row order convention (typically N×W×H). Flags like
firstmay apply per (N, W) column-of-tiles rather than per-tile.Layout-specific issue. ROW_MAJOR vs TILE vs bf16 vs fp32 can hit different reader kernel paths. If only some dtypes fail, look at the reader kernel for that dtype's branch.
If none of those: defer with a one-paragraph note in the test entry's
comment in scripts/run_ttnn_pytests.sh and move on. PCC-fix work
deserves its own focused round.
Pass-through to upstream — delete the shim entirely
When the emule shim is a simplification (not silicon-specific), prefer to delete the shim and let JIT include resolution fall through to upstream's real header. Emule already provides the silicon-specific primitives (NOC encoding, bank tables, etc.); upstream's templated C++ adapts on top of them.
The pattern:
- Delete
include/jit_hw/api/.../<thing>.h. - Change one-line
#include "jit_hw/..."references to#include "api/..."so JIT resolves throughtt_metal/hw/inc/instead. - Define
KERNEL_BUILD=1in the JIT defines so upstream's#if defined(KERNEL_BUILD)branches pick the in-kernel codepath (forward decls ofget_common_arg_addr, theInterleavedAddrGen-inheriting interleaved DRAM specialization, etc.). - Make
get_compile_time_arg_val<N>bounds-safe (return 0 forN >= size) so upstream'sTensorAccessorArgs<CTA_OFFSET>constexpr parsing doesn't choke when the host emits fewer slots than the template scans. - Add
noc_traits_t<>specializations for the iterator-yielded types upstream produces (e.g.tensor_accessor::Page,ShardView<Accessor>) — extract NOC address → route through__emule_resolve_noc_addrfor the host-pointer lookup.
Watch out for non-power-of-2 bank counts. WH-N150 has 12 DRAM banks.
Upstream's interleaved_addr_gen::get_bank_offset_index<DRAM> uses
bit-shift when the count is a power of two (gated on
LOG_BASE_2_OF_NUM_DRAM_BANKS) and a fast-divide otherwise (gated on
IS_NOT_POW2_NUM_DRAM_BANKS). Emule must emit one of these defines —
see build_kernel_defines in
tt_metal/impl/emulation/emulated_program_runner.cpp.
Watch out for upstream kernels that include sharding_addrgen.hpp
(common via untilize/repeat_interleave/...). With KERNEL_BUILD
defined, an overload references InterleavedPow2AddrGenFast<DRAM> — add
it to include/jit_hw/internal/dataflow/dataflow_api_addrgen.h as a
pow2 variant of InterleavedAddrGenFast.
Separate kernel chain not shimmed
Some ttnn ops dispatch to alternate kernel implementations (moreh, fused,
deepseek, etc.) that live outside the standard eltwise_unary/ API surface.
These kernels often include headers under ttnn/cpp/ttnn/kernel/,
ttnn/cpp/ttnn/kernel_lib/, or arch-specific paths like
noc/noc_parameters.h (which on real silicon lives at
tt_metal/hw/inc/internal/tt-Nxx/<arch>/noc/).
Diagnostic pattern: the JIT compile fails not in eltwise_unary/<name>.h
but in some other kernel chain. Capture the failing kernel via
TT_EMULE_KEEP_JIT_SRC=1, look at the kept patched_kernel.cpp's
#include list, identify the missing header. Add a thin re-export shim
under include/jit_hw/ at the matching path.
Umbrella SFPU header (ckernel_sfpu.h)
Some kernels #include "ckernel_sfpu.h", the upstream catch-all of every SFPU
_calculate_* template; it's not on emule's include path. Do not add the
real LLK header — it double-declares symbols emule already provides (e.g.
topk_* from compute_kernel_api.h). Add a minimal include/jit_hw/ckernel_sfpu.h
with only the symbols the kernels actually consume (grep for sfpu:: uses), each
a no-op/forwarder per the reduce-jit_hw-surface rule.
Composed ops (no standalone upstream <name>_tile)
Some activation-style ops are not standalone SFPU primitives — they're
composed in-kernel from underlying primitives. If a sub-agent returns
STUCK because "upstream has no <name>_tile," that's correct behavior.
Diagnosing a composed-op failure: when the JIT compile fails on
<name>_kernel.cpp, check whether the kernel calls into primitives we
already have. If yes, the gap is in the primitive (or in
compute_kernel_api.h's coverage of it) — not a missing <name>_tile.
Typical fix: add the primitive name to compute_kernel_api.h so the
catch-all include surfaces it.
Porting upstream polynomials (PCC triage)
When <cmath> numerics drift past atol/ULP tolerance, port the exact
polynomial form from
tt_metal/hw/ckernels/wormhole_b0/metal/llk_api/llk_sfpu/ckernel_sfpu_<name>.h.
Process:
- Open the upstream
ckernel_sfpu_<name>.h. Identify regions (range splits, saturation thresholds), coefficient arrays (Horner-form polynomial), and boundary branches. - Copy the coefficient constants verbatim as
constexpr float. Don't round, don't reformat. - Translate the SFPI vector eval (
sfpi::vFloat,v_if/v_endif) to scalar per-element math:if (cond) {…} else {…}inside the__EMULE_TILE_ELEMSloop. - Use only
<cmath>builtins (std::exp,std::log,std::sqrt,std::fabs,std::nearbyint,std::ldexp,std::copysign). No platform intrinsics.
A polynomial port is not guaranteed to pass PCC. If a new port regresses
cases that were passing under a simpler <cmath> formulation, the new form
has a bug (often in the reconstruction step) or differs from upstream
somewhere subtle. Revert before chasing. Don't replace a working simple
form unless tests are demonstrably failing on it.
Stateful ops & first flag — confirm the shim is even used
A shim file's existence doesn't imply ttnn uses it. The host op may preprocess inputs (permute, reshape) so a different kernel chain runs.
Diagnostic: before debugging a stateful-op shim's math, dump the JIT
wrapper (TT_EMULE_KEEP_JIT_SRC=1), grep the kept patched_kernel.cpp
for the shim function name. If it's not there, the failure is upstream
of your shim — chase the actual primitives the kernel does call.
For ops that DO use stateful per-tile primitives (PRNGs, per-tile
accumulators), the thread_local accumulator pattern is correct. But
verify before assuming.
Anti-patterns (consolidated)
- Don't read host code.
tt_metal/llrt/,tt_metal/impl/dispatch/,tt_metal/soc_descriptors/— none of these are relevant to a shim. If you find yourself there, the failure isn't a shim gap. - Don't accumulate JIT cache + restart as a substitute for understanding
the failure. Each
rm -rf /tmp/tt_emule_jit_cache_$(id -u)should be deliberate (e.g. after a shim edit), not a cargo-cult retry. - Don't duplicate shims that
activations.horcompute_kernel_api.halready define. ODR conflicts result. Check first withscripts/find_symbol.py --supports <op>_tile(layer1+ a path = already shimmed) or plain grep of.claude/references/structure.yaml. - Don't invent upstream signatures. If upstream has no header, return STUCK or ask. Some activation-family ops are composed in-kernel and have no standalone API.
- Don't iterate PCC failures more than a few times. ~3 attempts max per shim, then document and move on.
Batch mode
For sweeps that bring up ≥4 shims at once, dispatch one sub-agent per
shim via /parallel-mock-implementation. The orchestrator (you)
handles sfpu_split_includes.h wiring, refreshing the index
(python3 scripts/gen_structure.py --write — don't hand-edit symbols), build,
and per-op test runs centrally after the workers return.
References
/implement-mock— the broader strategy-picking flow this skill specializes (Strategy A for compute shims)./parallel-mock-implementation— Workflow-tool dispatch pattern for batch shim authoring./index-based-ops— bring-up playbook for value+index ops (TopK / Sort / Argmax): the values-exact + gather-cosine test contract, the per-column sort axis, and the CB/unpack infra gaps they surface..claude/references/structure.yaml— file-level index ofsrc/+include/with top-level symbols. Grep first when triaging..claude/references/structure.yaml— the authoritative index of what's currently shimmed (everyinclude/jit_hw/api/compute/file + its<op>_tilesymbols).- See
/implement-mockReferences for the broader project conventions (CLAUDE.md,BUILD_GUIDE.md).