replay-stall-diagnose

star 1

Use this skill when a session replay hangs or times out and you need systematic CPU/live-lock diagnosis with repeatable artifacts.

davidbau

By davidbau schedule Updated 3/6/2026

play_arrow Run Skill in Manus View GitHub

name: replay-stall-diagnose description: Use this skill when a session replay hangs or times out and you need systematic CPU/live-lock diagnosis with repeatable artifacts.

Replay Stall Diagnose

When To Use

Use this when session_test_runner times out or appears stuck, especially when:

no RNG/event comparisons are produced (rng=0/0, events=0/0)
runs stall before clear first divergence evidence
ad-hoc logging is too noisy to identify root cause

Goal

Localize the hottest runtime path causing timeout/live-lock and convert a generic timeout into actionable evidence.

Primary Tool

scripts/replay_stall_diagnose.mjs

It wraps session_test_runner with Node CPU profiling and writes:

run.log (full run output)
summary.txt (top self-sample functions/files)
summary.json (machine-readable hotspot summary)

Standard Workflow

Reproduce timeout quickly on one target session:

node scripts/replay_stall_diagnose.mjs \
  --session seed325_knight_wizard_gameplay \
  --timeout-ms 12000 \
  --top 20

Read summary.txt first:

Confirm the timeout line matches the target failure.
Identify dominant self-sample frames (usually top 1-3 functions).
Identify dominant files (e.g. vision.js, monmove.js, hack.js).

Form a root-cause hypothesis from hot frames:

Vision-heavy hotspot (right_side, q4_path): suspect map/FOV loops, repeated scans, or runaway traversal.
Monster-move hotspot (movemon, dog_move, dochug): suspect movement loop invariants or repeated no-progress turns.
Input/prompt hotspot (nhgetch, pendingPrompt paths): suspect unresolved waits, prompt loops, or boundary bugs.

Add narrow tracing only where hotspot points:

Avoid global logging first.
Use one or two focused env traces per run (for example command, run, or monmove traces).

Re-run the same diagnose command after each fix candidate.

Compare whether dominant hotspot percentage drops or shifts.
Keep the same session and timeout while iterating for consistent comparison.

Practical Tips

Start with shorter timeouts (--timeout-ms 6000 to 12000) for faster iteration.
Run 2-3 repeated profiles before concluding a hotspot is stable.
Prioritize self samples over inclusive stack intuition; they indicate where CPU is actively spent.
Ignore small GC/native noise unless it dominates.
If no .cpuprofile is generated, treat as setup/runtime failure and fix that first.

Differential Profiling Pattern

Use A/B profiling to validate fixes:

Capture baseline profile + summary.
Apply one focused code change.
Capture post-change profile with same session/timeout.
Compare:
- timeout still present or cleared,
- top function/file percentages,
- whether hotspot moved to a new stage.

Guardrails

Do not add comparator exceptions to hide timeout/divergence.
Do not add replay-core synthetic input/auto-dismiss/timing compensation.
Keep fixes in core gameplay/runtime logic and preserve C-faithful behavior.

Output to Share in Issue Updates

Include:

exact command used
timeout line from run.log
top 3 functions and top 3 files from summary.txt
root-cause hypothesis
next narrow instrumentation/fix step

Done Criteria

Timeout converted into a concrete, repeatable hotspot diagnosis.
A targeted fix is implemented and validated, or a bounded follow-up issue is filed with profiling evidence.
No harness/replay masking hacks introduced.

Install via CLI

npx skills add https://github.com/davidbau/menace-classic --skill replay-stall-diagnose

Repository Details

star Stars 1

call_split Forks 1

navigation Branch main

article Path SKILL.md

More from Creator

davidbau

davidbau Explore all skills →