name: uplift description: Diagnose and fix emule regressions caused by bumping the tt-metal pin (and its bundled tt-umd submodule). Use when a pin bump turns the C++ or TTNN regression red — device-open crashes, JIT-compile errors, hangs, or new data mismatches — and you need to prove the cause and land a faithful emule-side fix.
Uplift: surviving a tt-metal / tt-umd pin bump
When the tt-metal pin moves forward (e.g. a companion change merges to main, or
you re-pin to a newer commit), you import two moving parts at once: the
tt-metal source and the tt-umd submodule it points at. Either can regress the
emule build because emule substitutes its own runtime (SWEmuleChip,
emulated_program_runner.cpp) and JIT kernel shims (include/jit_hw/) for the
real hardware path. This skill is the methodology for turning a red post-bump
regression into a proven root cause and a minimal, faithful fix.
The governing principle, learned the hard way: prove it by building and
running, not by reading git log. Pickaxe/git show/blame routinely mislead
(moved code, squash merges, shallow submodule object DBs that render a commit as
a whole-tree add). A bisection that compiles and runs an oracle is the only
authority.
When to invoke
- A pin bump (or
tt-metal-pin.txtchange) makes a previously-green regression fail. - Symptoms after a bump: device-open aborts (
SIGABRT/SIGSEGVduringstart_device),jit_compile_kernel: compiler failed(use of undeclared identifier ...), a deterministic hang, or new ATOL/PCC mismatches across many unrelated ops. - You suspect the bundled umd bump rather than tt-metal source, and need to isolate which.
For a single missing compute shim (one <op>_tile), use /compute-llk-bringup.
For data-corruption on a kernel that runs without crashing, use /memory-debug.
This skill is for broad regressions that a pin bump introduced.
Step 0 — Read the failure correctly, don't trust the label
- Inspect artifacts, not the check's green/red. A C++ regression job can show "pass" while its result artifact shows aborts — known-failure allowlists and classify steps can absorb failures, and exit codes don't always propagate. Open the actual run log/artifact and read the per-test verdicts.
- Classify the failure mode before chasing — they have different root-cause
paths:
- device-open crash (fails in the
devicefixture /start_device, before any op) → a shared runtime/reset/topology path, almost always umd or the emule chip. - JIT-compile error (
undeclared identifier, header not found) → a jit_hw shim drift; the kernel can't compile against emule's headers. - hang → often a JIT-infra issue (e.g. forking a compiler from a many-threaded process), not the op.
- data mismatch → op-correctness; usually
/memory-debugterritory.
- device-open crash (fails in the
- Per-arch scope. Supported arches gate the decision; an out-of-scope arch is typically allowlist-gated and expected-broken — don't let its churn block a bump, but don't silently regenerate its allowlist over a real new regression either.
Step 1 — Build a fast, deterministic oracle
Pick the smallest, fastest test that reproduces the failure with an
unambiguous signal, and wrap it so it emits exactly GOOD / BAD / SKIP:
GOOD= clean pass.BAD= the specific signature (e.g. a particular abort message + address, or the exact compiler error). Match the signature, not just "non-zero exit", so an unrelated failure can't masquerade as the bug.SKIP= build break or any other outcome → maps togit bisect125.
Step 2 — Bisect: isolate metal-source vs the umd bump
The pin range moves both metal and the umd gitlink, so bisect in two stages.
- Metal bisect.
git bisect start <new-pin> <old-pin>in tt-metal, thengit bisect runa script that, per commit: updates the umd submodule to that commit's recorded gitlink, builds only the oracle's target (not the whole tree), runs the oracle, and returns 0/1/125.- If the first-bad commit changes only the umd gitlink, the regression rides with the umd bump — proceed to the umd bisect.
- If it's a metal-source commit, the cause is in tt-metal; triage that commit.
- UMD bisect. Pin metal at the first-bad (umd-bump) commit, then
git bisect start <bad-umd> <good-umd>inside the submodule and bisect with the same oracle, varying only umd. Build breaks on old-umd-vs-newer-metal becomeSKIP.
Bisection rules that keep it honest:
- Hold the emule layer fixed (the jit_hw headers and any local emule source) for the whole bisect — the variable under test is metal+umd, exactly what the bump changed.
- Build the minimal target so each step is a fast incremental relink.
- Automate with
git bisect runso you don't hand-judge steps. - Restore all repos (
git bisect reset, re-checkout the intended pin, resync the submodule) when done.
Step 3 — Prove the boundary and the mechanism
- Controlled before/after: build+run the oracle at
culprit^(GOOD) andculprit(BAD) with everything else fixed. That's the proof. - Read the mechanism from a two-tree diff, not
git show.git show <culprit>can render a whole-tree add in a shallow/grafted submodule; usegit diff <culprit^> <culprit> -- <file>between two checked-out trees. - Read the upstream PR the culprit came from. It is the authoritative spec for any emule-side mirror — its diff tells you the exact old→new API or behavior mapping to replicate.
Step 4 — Design the fix (two recurring classes)
Most uplift regressions are one of these.
A. Behavioral regression in a shared runtime path
Upstream changed a path that emule used to neutralize. Example shape: a device bring-up step that used to dispatch through a per-chip virtual the SW chip overrode as a no-op now issues an unconditional register write that the emule memory model can't satisfy.
Fix principles:
- Preempt at the right layer, using an existing mode/chip-type signal the code already branches on. Gate strictly on the emulated case so silicon / simulation / other modes are untouched.
- No magic constants when a canonical accessor exists; prefer "skip the operation for this mode" over "match a hardcoded address".
B. API-surface drift in the JIT kernel shims (jit_hw)
Upstream renamed/refactored a kernel-facing API (an enum collapse, a template
signature change, a moved header) and the emule shim still exposes the old
surface, so kernels fail to JIT-compile with undeclared identifier.
Fix principles:
- Mirror the upstream change exactly in the shim — same enum/struct/signature surface — using the upstream PR diff as the spec.
- Preserve emule's bodies (host
memcpy/ no-op / emulated sync). Only the option/signature surface changes; the emulated semantics stay. - Migrate all dependents in the same surface (the header's call sites, sibling
headers that reference the renamed symbols) so nothing dangles. Leave genuinely
independent/older surfaces (e.g. a separate
experimental/API with its own self-contained definitions) alone. - Keep
.claude/references/structure.yamlin sync when symbols are added/removed.
Step 5 — Wire the fix through the repo chain
A fix can span tt-umd → tt-metal → tt-emule. Order and push mechanics matter:
- Push dependencies before dependents. If the metal commit bumps the umd gitlink to a fix commit, that umd commit must be on its remote first, or the submodule pointer dangles for CI.
- Never push without explicit go-ahead; for tt-emule's own PR, bump
tt-metal-pin.txtto the (pushed) metal fix commit, using the full SHA. - jit_hw edits need no C++ rebuild. Shims are compiled at JIT time, so a jit_hw change takes effect on the next kernel compile — clear the JIT cache and re-run; iteration is fast.
Step 6 — Verify recovery and prove no regression
- Oracle before/after on the built fix: the
BADsignature is gone. - Broad sweep: run one test per previously-failing file. The headline invariant is that the crash-signature count goes to zero; PASS count rises with no new failures.
- Triage every residual into exactly one bucket: (a) caused by the fix — must be zero; (b) the original bug — must be zero; (c) a separate pre-existing gap that was merely unmasked once earlier crashes stopped. A fix that doesn't touch op math can't cause a data mismatch — reason from that.
- Don't conflate a sweep with the curated suite. A one-test-per-file sweep
runs the first collected test in each file, which is usually not what the
curated regression script selects (it uses
-k/::nodesubsets). Before treating a sweep failure as real, check whether the curated script even exercises that node:- Map the failing node to the script's exact
run_pytestargs and resolve membership with--collect-only. - Validate your collection-parsing first. Verbose pytest config (
-vvs) makes--collect-onlyprint a<Function ...>tree, notfile::nodelines — grep for the<Function ...>reprs, and confirm your grep matches a test you know is selected before trusting any zero counts. (Class-based tests areClass::method; a barefile::methodnodeid will error.)
- Map the failing node to the script's exact
Step 7 — De-scope and track what isn't yours
- A residual that is a separate pre-existing gap (e.g. a missing compute-LLK shim, a known SIGFPE, an op-correctness mismatch) should be filed and tracked, not used to block the uplift, and not "fixed" by widening the uplift fix's scope.
- If the uplift's root cause is upstream-intended (a legitimate refactor), the
emule side is what must adapt — the fix belongs in the emule layer
(umd SW chip /
jit_hw), not a revert. - Note any cross-arch implication (a shim/runtime fix for the supported arch often also addresses the same root cause flagged for an out-of-scope arch) but don't chase the out-of-scope arch in the same change.
Anti-patterns (consolidated)
- Concluding from
git log -S/git show/ blame. They mislead on moved code, squashes, and shallow submodules. Bisect (build+run) is authoritative. - Trusting a green CI label. Read the artifact; allowlists hide aborts.
- A range-based guard that swallows real errors. Gate on the specific case/mode, not "anything that would fault".
- Intercepting low-level I/O to dodge a high-level behavior change. Preempt the operation at the layer that owns it.
- Hardcoding magic values when a canonical accessor or a mode flag exists.
- Pushing a dependent before its dependency (metal gitlink before the umd commit it references).
- Reading a one-test-per-file sweep as the curated suite. Check
--collect-onlymembership, with a validated parser.
References
/compute-llk-bringup— for a single missing compute<op>_tileshim (the common Class-B fix when the drift is one op)./memory-debug— for data-corruption residuals (Class-C, op math).docs/api-injection-points.md— where emule intercepts the pipeline (helps choose the "right layer" in Step 4A).references/emule-mapping.md— HW concept → emule strategy vocabulary.BUILD_GUIDE.md— build/test setup, the targets to build for an oracle, and the regression scripts.