name: parish-engine description: Drive the Rundale engine to see gameplay actually working — the script harness, feature proofs, autonomous play-tests, eval rubrics/baselines, the live LLM auto-player, browser UI sessions, and GUI screenshots. Use after changing world, movement, NPC, input, inference, UI, or mod code, and whenever you need to prove a gameplay change is live rather than that tests pass. disable-model-invocation: false argument-hint: '[harness|prove|play|rubric|demo|browser|screenshot] [scenario or feature]'
One skill for every way of running the engine to observe behaviour. Pick the section that matches the
job; they share a build step (cargo build first) and the same --script JSON harness underneath.
| You want to… | Section |
|---|---|
| Run fixture scripts and check JSON output | Script harness |
| Prove a specific feature is live at runtime | Prove a feature |
| Autonomously explore / stress the game | Exploratory play-test |
| Get a deterministic, no-LLM regression sensor | Eval rubrics & baselines |
| See a feature in the live Tauri app (dialogue, streaming, IPC) | Auto-player / demo |
| Click through the web UI in a real browser | Browser UI session |
| Regenerate GUI screenshots | GUI screenshots |
/prove, /play, and /rubric form a ladder: rubric (deterministic, machine-checked) → prove (you read
the JSON) → play (autonomous exploration). Demo and browser sessions cover paths the headless harness can't
reach (Tauri IPC, streaming, the rendered UI). None of them replace /check.
Script harness
Run fixture scripts through --script mode, which emits structured JSON per command.
- Build first:
cargo build. - Run the script:
- Specific fixture:
cargo run -- --script <script-file> - Default walkthrough:
cargo run -- --script testing/fixtures/test_walkthrough.txt
- Specific fixture:
- Inspect the JSON line by line. Check that:
- Movement results have valid
tolocations and reasonableminutes - Look results contain non-empty descriptions
- System commands return expected, non-empty responses
- No unexpected errors, panics, or
"result": "unknown_input"
- Movement results have valid
- Run additional fixtures if the default passed:
cargo run -- --script testing/fixtures/test_movement_errors.txtcargo run -- --script testing/fixtures/test_commands.txtcargo run -- --script testing/fixtures/test_speed.txt
- Report which scripts passed and flag any anomalies.
Script command reference:
- Movement:
go to <location>,walk to <location> - Look:
look,look around - System:
/status,/time,/map,/npcs,/wait [N],/tick - Persistence:
/save,/fork <name>,/load <name>,/branches,/log - Speed:
/speed fast,/speed normal,/speed slow - Control:
/pause,/resume,/new - Debug:
/debug,/debug npcs,/debug clock,/debug here,/debug schedule
Location names come from /map or mods/rundale/world.json.
Prove a feature
Prove a gameplay feature works at runtime — not just that tests pass.
- Write a targeted script at
testing/fixtures/play_prove.txtthat exercises the feature from a player's perspective. Use/waitto advance time, move between locations, and use/time,/status,/debug clock,/debug npcs,look,/npcsto make the feature's impact visible. - Run it:
cargo run -- --script testing/fixtures/play_prove.txt - Read the JSON critically. For each line ask:
- Do values change when expected (weather transitions, NPC relocations)?
- Do descriptions read naturally — grammatical and immersive to a player?
- Does NPC behaviour respond correctly to the new feature?
- Are any fields empty, nonsensical, or stuck at their initial value?
- Fix what you find. Common issues:
- New tick/update logic added to
parish-serverandheadlessbut not the test harness (crates/parish-cli/src/testing.rs) — the harness has its own game loop inadvance_time()andCommand::Tick. - Large
/waitjumps that call your logic once at the final timestamp instead of each intermediate step. - Template interpolation producing ungrammatical text when a new enum variant has a multi-word Display.
- Features that silently no-op because a required field isn't wired up in a constructor.
- New tick/update logic added to
- Re-run until the output proves the feature is live. Don't stop at "tests pass" — stop at "I can see it working in the game output."
- Report what you tested, what the output showed, and any fixes made.
Think like the player. Would someone who doesn't know the code understand what's happening? Would a game creator accept this output quality? If not, the feature isn't done.
Exploratory play-test
Autonomous play-test to evaluate the gameplay experience.
- Build first:
cargo build. - Determine what to test:
- If given a
.txtpath, use it as the script directly. - If given a scenario ("explore all locations", "talk to every NPC", "test time passage"), generate a
script at
testing/fixtures/play_session.txt. - If nothing is given, generate a comprehensive exploration script that checks
/status,/time,/map,/npcs; visits every reachable location; uses/waitand/tickto advance time and observe schedule changes; tests/save,/fork,/branches,/log; and toggles/speed fast//speed normal.
- If given a
- Run it:
cargo run -- --script <script-file>. - Analyze the JSON per the same checks as the Script-harness section (movement, system, map, NPCs, time, wait, errors).
- Report a play-test summary: locations visited and whether descriptions generated, NPCs encountered and where, time/season progression, anomalies/bugs/missing features, and an overall assessment.
Eval rubrics & baselines
The deterministic half of the harness: snapshot baselines + structural rubrics, no human reading and no LLM judge required.
What this checks:
- Snapshot baselines. Each fixture in
BASELINED_FIXTURES(crates/parish-cli/tests/eval_baselines.rs) is run throughrun_script_captured, serialized to JSON, and diffed againsttesting/evals/baselines/<fixture>.json. Any drift fails with a "live | baseline" diff window. - Structural rubrics asserted on every baselined fixture:
rubric_anachronisms_are_empty— no NpcResponse surfaces anachronistic terms.rubric_movement_minutes_are_positive— no Moved withminutes == 0(catches a frozen clock).rubric_look_descriptions_are_non_empty— no Looked with empty description (catches a silent renderer failure).
Steps:
- Run the suite:
cargo test -p parish --test eval_baselines. - Baseline test fails — read the "live | baseline" diff:
- Unintentional drift (a regression): fix the code, rerun.
- Intentional drift (you changed gameplay deliberately): confirm the new output is correct, run
just baselinesto regenerate, and reviewgit diff testing/evals/baselines/.
- Rubric test fails — the panic names the fixture, step, and canonical fix. Fix the code, rerun.
Adding a fixture to the baseline set:
- Confirm its output is deterministic — run it twice and
diffthe JSON. Differences innew_log_linesare fine (not part ofScriptResult); differences elsewhere are not. - Add the stem to
BASELINED_FIXTURESand a matching#[test] fn baseline_<fixture>(). - Run
just baselinesto capture the initial JSON. - Commit the test entry alongside its baseline.
Use this after editing parish-world, parish-npc, parish-cli/src/testing.rs, or mods/rundale/;
before opening/updating a gameplay PR; or as a faster, lower-noise alternative to Prove a feature when
the change is structural.
Auto-player / demo
The LLM auto-player drives the live Tauri app — use it for paths the headless harness can't exercise: Tauri IPC, NPC dialogue quality, frontend streaming behaviour.
just demo 2 5 # 5 turns, 2s pause — fast smoke test
just demo 4 20 # 20 turns, 4s pause — content generation / sustained observation
just demo 3 # unlimited turns
Capture logs to read the transcript:
DEMO_LOG=$(mktemp) && just demo 2 5 > "$DEMO_LOG" 2>&1
grep -E "chat \[|demo turn|WARN" "$DEMO_LOG"
chat [player] / chat [npc] lines show the conversation; demo turn: LLM chose action shows the
auto-player's pick each turn.
Verify:
- Player actions are single-line natural language — no reasoning preamble or JSON artifacts.
- NPC dialogue is Irish-authentic and responds to what the player actually said.
demo turnfires each turn — if absent, the LLM call is hanging or failing.- No
waitForFalse timed outwarnings — streaming completed cleanly. - The clock advances between turns — game is not paused.
Bugs demo surfaces that the headless prove misses: streaming freezes (input stays disabled, 30s timeout
fires), thinking blocks leaking into player actions, NPC saying nothing (429 rate limit or JSON field-name
typo), game clock paused throughout. Demo is observational — it surfaces live behaviour, it does not
assert. It does not replace just check, the prove flow, or Playwright.
just demo opens the MCP bridge on --mcp-port 3030, so while it runs (or after launching the Tauri app
yourself with cargo run -p parish-tauri -- --mcp-port 3030) you can drive and inspect the live game with
mcp__parish__*: parish_world_snapshot / parish_npcs_here / parish_submit_input to probe, and
mcp__parish__parish_file_bug to file any bug as a GitHub issue that auto-bundles a screenshot (a native
window capture — the minimap renders, #1160), recent logs, and game state. Dedup against open issues first
(gh issue list --search); set PARISH_BUG_REPORT_DRY_RUN=1 to write the report to disk instead of filing
while testing. For the full bug-hunting loop see /demo-audit.
Browser UI session
Interactive Chrome test against the web server via browser MCP tools. Follow the plan in
docs/plans/archive/chrome-test-plan.md.
Setup:
- Build frontend:
cd apps/ui && npm run build. - Check server:
curl -s -o /dev/null -w "%{http_code}" http://localhost:3001/. - Start if needed:
cargo run -- --web 3001(background; wait for 200 on health check). - Connect browser tooling via the available browser MCP tools. If no extension is connected, ask the user to enable it and retry.
- Navigate a tab to
http://127.0.0.1:3001.
Test execution — take screenshots at key points, track pass/fail:
- Required: Page Load (status bar, map, NPCs sidebar, chat panel, input render); Navigation (travel to
≥2 locations, verify map/status/NPCs update); Edge Cases (invalid location, already-here, empty submit);
System Commands (
/help,/status,/pause,/resume); Console Check (read console for errors at start and end). - If LLM provider configured in .env: NPC Conversation (streaming response); Irish Words (Focail panel populates); Idle Message (atmospheric message at empty location).
- If explicitly requested: Debug Panel tabs; Speed Commands; Theme Updates over time.
Reporting: print a pass/fail table; list bugs with repro steps; report console errors; check server logs.
If bugs are found, file them with mcp__parish__parish_file_bug (auto-bundles screenshot + logs +
state) — dedup against open issues first (gh issue list --search). Write results to
docs/reviews/chrome-testing-session.md (append a dated section if it exists).
GUI screenshots
Regenerate Rundale GUI screenshots after UI changes via headless Playwright — no X11/GDK required.
- Install deps (if needed):
cd apps/ui && npm install. - Capture:
cd apps/ui && npx playwright test e2e/screenshots.spec.ts— captures the GUI at 4 times of day (morning, midday, dusk, night) on headless Chromium with mocked Tauri IPC. - Verify output: check
docs/screenshots/for updated PNGs; list files and sizes. - Report which were generated; remind the user to commit them alongside the UI change.
Notes:
- To update visual-regression baselines:
npx playwright test e2e/screenshots.spec.ts --update-snapshots. - Full E2E suite:
npx playwright testorjust ui-e2e.