name: parity-rng-triage description: Use this skill when debugging C-vs-JS gameplay parity drift in session replay, especially first RNG/event/screen divergence localization and core-JS behavioral fixes.
Parity RNG Triage
When To Use
Use this for session parity failures where gameplay diverges between C and JS:
- RNG mismatch
- Event log mismatch
- Screen mismatch with matching RNG/events
Inputs Expected
- Session path (
test/comparison/sessions/...) - Current branch/commit
- Latest failure output from
session_test_runner.js
Workflow
- Pull first — other agents push frequently; start from current main:
git pull --rebase origin main
- Check issue labels before diving into a session. Your agent label is
determined by your working directory name (e.g.
mazesofmenace/ux→agent:ux). Divergences indochug/monmove/makemonare often labeledagent:game(game engine agent). Don't work on issues labeled for other agents — check the label matches your directory before starting. - Survey all failing sessions with the PES report (most informative view):
scripts/run-and-report.sh— runs all gameplay sessions, then shows a color-coded table of PRNG/Event/Screen first-divergence step per session.--failuresfilters to failing rows only.--whyadds AI diagnosis labels. If this script is unavailable in your checkout, use:npm testnode scripts/pes-report.mjs
node scripts/pes-report.mjs— instant replay of last results without re-running.npm test— runs all test categories (unit/chargen/map/gameplay/special) and prints first-divergence JSON blobs for failing sessions. Note:npm testruns 34 gameplay sessions; the pre-push hook runs all 150.
- Pick a session and reproduce with verbose output:
node test/comparison/session_test_runner.js --verbose <session-path>node --test test/comparison/sessions.test.js— runs the full session comparison suite and prints a pass/fail table with first-divergence JSON for each failing session. Faster feedback thannpm testfor gameplay-only iteration.node scripts/run-test-gates.mjs <seed>— alternative: shows per-step failure details for a specific seed across all test categories.
- If RNG diverges, localize first mismatch window:
node test/comparison/rng_step_diff.js <session-path> --step <N> --window 8node scripts/comparison-window.mjs <session-path> --channel rng --view all --raw-stepgives the practical three-layer view:- normalized first-divergence window,
- gameplay-filtered raw for the same step,
- full raw for the same step.
- Useful variants:
--view normalizedfor the concise headline--view filtered-raw --raw-stepfor boundary/ordering bugs--view raw --raw-stepwhen you suspect the filter is hiding signal--channel event --view normalizedfor event-first localization
- Current limitation: raw step drilldown is implemented for RNG artifacts. Event artifacts still need normalized-to-raw step mapping before raw event views are trustworthy.
- Note:
rng_step_diff.jsreplays one step in isolation; use it as a microscope only. Treatsession_test_runner.js --verboseas authoritative for true first divergence. - For render-side RNG visibility (hallucination glyph/name drift), enable
display-stream logs with caller tags:
RNG_LOG_DISP=1 RNG_LOG_DISP_CALLERS=1 RNG_LOG_TAGS=1 node test/comparison/session_test_runner.js --sessions=<session-path> --verbose- Keep this diagnostic off by default; C sessions usually do not include display-stream RNG entries.
- To capture C display-stream entries for apples-to-apples analysis, rerecord
with:
NETHACK_RNGLOG_DISP=1 python3 test/comparison/c-harness/rerecord.py <session.json>- Use this only for diagnostic sessions (often in
/tmp) unless you intend to update fixtures.
- Capture rich state snapshots around divergence with debug mapdump:
node test/comparison/dbgmapdump.js <session-path> --steps <N> --window 1- Inspect with:
diff -u step00NN_raw*.mapdump step00MM_raw*.mapdumprg -n '^(U|A|M|N|K|J)' <mapdump-file>
- Use this when screen/RNG evidence is insufficient and you need direct monster/object/trap/hero state at exact replay steps.
- For monster-generation /
adj_lev()/newmonhp()seams, enable:WEBHACK_MAKEMON_TRACE=1 node test/comparison/session_test_runner.js --verbose <session-path>- This logs the exact
newmonhp()inputs:- monster index/name
- base monster level
- passed depth
- current
u.ulevel - computed adjusted level
in_mklevflag- live/generation dungeon context
- Use it when branch-depth or special-level context is suspected to be wrong.
- Confirm expected behavior in C source:
- Use
nethack-c/patched/src/— this is the primary reference. It isnethack-c/upstream/plus all instrumentation patches (RNG logging, event tracing, harness hooks) that make session recording possible. - The ultimate goal is matching vanilla upstream NetHack behavior, but
patched/is the measurable target: it's what generated the sessions.
- Use
- Patch JS core behavior to match C semantics.
- Re-run the same session, then a targeted set:
node test/comparison/session_test_runner.js --verbose <session-path>node test/comparison/session_test_runner.js --type gameplay --sessions=<seedA,...>
- Record durable learning in
docs/LORE.md.
Guardrails (Non-Negotiable)
- Do not add comparator exceptions/masking to hide mismatches.
- Preserve the C execution model:
- single-threaded gameplay flow,
- one active input owner at a time,
- no gameplay reentrancy,
- no queue/continuation tricks that reorder command vs monster work.
- Do not add replay compensation logic in
js/replay_core.js:- no synthetic queueing
- no deferred/auto key injection
- no auto-dismiss for prompts
- no timing compensation that changes semantic input stream
- Do not solve timing bugs by inventing non-C scheduling mechanisms:
- no deferred continuation tokens
- no parallel prompt/input owners
- no "resume later" machinery that changes who owns the next key
- Do not "fix" parity by modifying session expectations to match JS output.
- Fix behavior in core JS game logic to match C.
- Do not overfit to one seed: before committing, validate the fix on at least 1-2 additional nearby gameplay sessions.
Quick Triage Heuristics
- If RNG diverges first: find the first branch/function-call mismatch and fix that root cause.
- If RNG/events match but screen diverges: inspect message timing/capture boundaries, animation boundaries, and display-state updates.
- If one step is short and the next step has a matching surplus in the same
event families, treat it as cross-step boundary drift first:
- Compare per-step counts with
node scripts/comparison-window.mjs <session> --step-summary --step-from <N> --step-to <M>. - Then inspect the first bad step with
node scripts/comparison-window.mjs <session> --view filtered-raw --raw-step. This is the default drilldown because normalized views can hide intra-step ordering bugs while full raw is often too noisy. - Confirm conservation by event family (for example,
test_move,movemon_turn,dog_*,runstep) across adjacent steps. - If conserved, avoid changing gameplay logic first; adjust capture timing and rerecord with a targeted pause at the boundary key.
- Compare per-step counts with
- If RNG/events match 100% but mapdump section M fails with mhp=0 entries on
the session side: C's
fmonlist retains dead/failed monsters untildmonsfree(), which runs afterharness_auto_mapdump(). This is a C harness artifact, not a game logic bug — filter mhp=0 from both lists before comparing. - Prefer earliest shared drift signal over downstream cascades.
- For lower-overhead RNG logs during triage:
RNG_LOG_PARENT=0shortens caller tags.RNG_LOG_TAGS=0disables caller tags entirely.RNG_LOG_DISP=1logs JS display RNG calls as~drn2(...).RNG_LOG_DISP_CALLERS=1appends caller tags to~drn2(...)entries.
Unit Tests
Run after any core JS change and after every git pull:
node scripts/test-unit-core.mjs— fast unit-only run (~8s), clean pass/fail count. Prefer this overnpm testduring iteration.- Upstream pulls sometimes rename fields (e.g.
flee→mflee,sleeping→msleeping); unit tests that hardcode old field names will fail and need updating.
Debug Mapdump Notes
dbgmapdumpcaptures replay-time compact mapdumps without mutating fixtures.- For syntax and interpretation details, see
docs/DBGMAPDUMP_TOOL.md.
Comparison Window Best Practices
- Start with normalized output to find the first actionable divergence index.
- Immediately switch to gameplay-filtered raw for the same step when:
- the normalized mismatch is inside prompt/input completion,
- totals are conserved across adjacent steps,
- you suspect nested
--More--,yn, direction, travel, or occupation boundaries, - monster-turn work appears redistributed within one command.
- Escalate to full raw only after filtered raw, and only for the same step or a very small step range. Full raw is for confirming a suspected missing/extra low-level entry, not for first localization.
- Do not treat normalized parity as proof of correctness at tricky boundaries. A session can have normalized agreement while still hiding step-local work redistribution bugs that later become RNG or screen drift.
Rerecord Timing Advisory
- When capture timing needs to wait for C to finish a complex key (for example
_travel confirmation at.), encode the pause in-session:- per-step:
steps[i].capture.key_delay_s - or global regen override:
regen.key_delays_s(1-based gameplay step map)
- per-step:
- Then rerecord via:
python3 test/comparison/c-harness/rerecord.py <session.json>
- Keep this for true capture-boundary timing only; do not use it to mask core gameplay logic mismatches.
When a Fix Unmasks New Regressions
When a C-faithful foundational fix (e.g., bitfield semantics, call-chain ordering) causes previously-passing sessions to fail, treat those as newly exposed bugs, not reasons to revert the fix.
- Hold the known-correct fix. Do not revert it to quiet tests.
- Instrument surgically: temporary opt-in traces behind env flags.
- Fix the exposed bug, not the symptom. Re-run immediately.
- Remove temporary instrumentation after diagnosis.
- Broaden validation (nearby seed batch) before committing.
Gates for landing a foundational fix:
- Invariant gate — C-faithful rule remains in place
- No-mask gate — no comparator/replay masking added
- Regression gate — target session non-regressing vs pre-fix
- Breadth gate — nearby seeds show no severe new regressions
- Cleanliness gate — temporary instrumentation removed
Persistence on Hard Bugs
When working through a burndown where all issues must pass:
- Stay on the problem. Rebuilding context is more expensive than persistence. Don't context-switch to easier bugs hoping they'll be faster.
- Collect evidence, don't theorize. Add targeted traces, check actual values, compare C vs JS at specific points.
- Think in invariants. What should always be true? When is the first time a desired invariant is falsified?
- When it seems hardest, the answer is close. Most possibilities have been eliminated. Look more carefully at evidence already collected.
- Use the
trace-before-theorizeskill when stuck.
Done Criteria
- First divergence is eliminated or moved later with evidence.
- Target failing session is green or measurably improved.
- No harness/comparator/replay compensation hacks were introduced.
docs/LORE.mdupdated with what changed and why.
Caching Note
analyze_golden.js caches results keyed on commit hash. Results won't refresh
until you commit — if you need to see whether an uncommitted change affects
session results, use node --test test/comparison/sessions.test.js directly
(it always runs live against current code).
Commit/Push Cadence
- Once a regression fix is verified (target session and relevant targeted checks), commit promptly.
- Push validated increments promptly to keep other agents synchronized.
- Do not leave validated fixes stranded locally for long-running batching.
- If push fails, resolve and retry until successful.