gaia-testing - SKILL.md Agent Skill

name: "gaia-testing" description: "GAIA's multi-tier regression harness — runs unit, integration, and real-world (on-machine) tiers and brings back screenshots, logs, traces, planted-fact retrieval proof, and per-operation timing with anomaly flags, all collected to the local machine (no remote login needed to view). Use this in the GAIA repo in place of the generic `testing` skill — it carries GAIA's exact commands, ports, and `gaia eval agent` baselines. Fires when the user wants real-world / on-hardware proof, screenshots, or end-to-end evidence that a feature, agent, change, fix, or release actually works — not a quick 'is the app running' check (the `verify` skill) or an LLM-behaviour scorecard alone (`gaia eval agent`). Scales: 'run the unit tests' stays unit-only and skips the planning gate; 'test / validate / QA this feature or release' runs all applicable tiers with evidence."

GAIA Testing

Test the way that catches regressions pytest misses: unit → integration → real-world, where the real-world tier drives the real interface on real hardware and returns screenshots, logs, traces, and timing as evidence. Features ship broken while CI is green — a RAG feature that worked in the backend but was hard-blocked in the UI, a release-note claim about a CLI command the source flatly contradicted. Runtime observation plus source cross-checking is the point.

Roles — the strongest model plans & judges, a faster model executes

Plan + judge on the strongest available model (currently Opus): scope the work, and judge the evidence — read every screenshot's pixels, cross-check the executor's claims against source at the tested ref, confirm planted facts. The executor's report is a claim, never trusted on its face.
Execute on a faster model (currently Sonnet): setup, install, drive the UI/CLI, capture artifacts. Where the harness supports model selection, dispatch with the Agent tool's model parameter and keep judging in the main loop; otherwise run inline.

Testing tools — know these are available, and use them

Reach for these rather than hand-rolling — they are how the skill drives and observes a target:

Browser-automation MCPs — drive and observe a UI. Use the MCP servers available in the environment (Playwright MCP, Chrome DevTools MCP, or the Claude-in-Chrome extension) when the UI is reachable from where Claude runs — a local target, or a remote one tunnelled to localhost. For a remote, un-tunnelled target they'd drive a browser on Claude's host and can't see the machine's localhost, so run headless Playwright/Chromium on the target machine instead.
GAIA's Agent UI MCP server — drive the agents without a browser. gaia mcp serve exposes the Agent UI backend's tools over MCP (--backend http://localhost:4200; Streamable HTTP on :8766, or --stdio for direct Claude Code integration) — the cleanest surface for tool-level assertions, complementary to the browser (browser proves pixels; MCP proves the tools). Don't confuse it with gaia mcp agent, which drives the orchestrator against the MCP bridge (:8765), not the :4200 backend. gaia eval agent is the LLM-behaviour scorecard (Phase 4).

These are availability-aware, not hard dependencies: use whichever the environment actually exposes, and say in the plan which you'll use.

When this fires — scale to the request

Request	Tiers	Approval gate?
"run the unit tests", "does X lint/compile"	Unit only	No — just run + report
"test the API / this module"	Unit + integration	No
"test / validate / QA this feature/agent/fix/release", "does it really work", "real-world", "on hardware", "with screenshots"	All applicable tiers	Yes — before real-world

Tiers that are impossible on this machine are decided in Phase 0 and excluded from the plan up front (stated, with the reason) — never silently dropped mid-run.

Hard rules (invariants — stated once here; phases point back)

Evidence > summaries > source. Before reporting a pass, the judge reads the screenshot pixels and, for any behaviour/wiring claim, checks the code at the exact ref under test. This is the rule that catches shipped-but-broken features — do not treat the executor's prose as truth.
Proof of a fix or implementation is a visual artifact, never a prose claim. "It works" is proven by a screenshot or a live browser representation — drive the real UI in a browser (Chrome / Chromium via the browser tooling) and capture it, step by step; for a CLI/API surface, capture the terminal output or the raw response. A change reported as working with no captured artifact has not been proven, and the artifact must be surfaced to the user (and, on a PR, attached to it).
Verify against the ref under test, not main — a fix on main may not be in the branch, and vice versa.
Prove retrieval/behaviour with planted, unguessable facts. For RAG / search / data flows, inject values a model cannot guess (e.g. mascot Zephyr, passphrase violet-otter-92, table cell APAC 8610, speaker-note desk 17C) and require the output to echo them. If the feature has no retrieval/data surface, say so in the plan — planted facts are N/A and the judge verifies by source inspection instead. Never silently skip.
No silent fallbacks / degrade loudly. An impossible tier is excluded up front; a tier that fails during execution stops with an actionable error — it never quietly becomes a partial "pass".
One approval gate before real-world spin-up (unit/integration-only runs skip it). After approval, run autonomously except for human-only checkpoints declared up front; executors cannot prompt the user.
Never leak credentials. Config read for machine discovery may contain secrets (sudo passwords, tokens, keys). Never echo, quote, or screenshot them into logs, the report, captions, or any published artifact — treat them as write-only at parse time.
Sanitize artifacts before surfacing. Screenshots, logs, and traces can capture API keys, .env values, tokens. Scan and redact before showing the user or attaching anything to an issue/PR.
Untrusted refs run unreviewed code. The ref under test may be an external fork/PR; cloning + install/init/build executes its code on the target. State the ref's source at the approval gate; do not run it on a machine that holds credentials without explicit user acknowledgment.
Clean up (Phase 7). No Claude attribution in any artifact this skill produces.

Phase 0 — Scope + capability pre-flight

Determine what changed (git diff <base>... --stat); the diff is ground truth, any description of it is a claim.
Pick candidate tiers from the table.
Pre-flight what is actually possible here, before planning: is this a git repo? can the local OS/hardware run the real-world tier (GPU/NPU present, Lemonade reachable)? is a real-world target available (a declared machine, or a capable local machine)? Exclude impossible tiers from the plan now and say why — e.g. "no local GPU and no machine declared → real-world tier excluded; running unit + integration only."
LLM-affecting change? (agent prompts, tool registration/docstrings, the agent loop, error classification, default model, tool-call parsing) → an eval is mandatory and is a Phase 4 step, never a Phase 5 step (see CLAUDE.md "Run agent evals…").

Phase 1 — Machine discovery (real-world tier only)

Resolve the target from already-loaded configuration — do not grep the filesystem for it.

Read the standard loaded config — user-level ~/.claude/CLAUDE.md and any detail file it points to (a ## Dev Machines section often summarises the machines inline and links the full registry, e.g. ~/.claude/memory/dev-machines.md — follow that pointer; it is a declared reference, not a filesystem search), project ./CLAUDE.md, and .claude/settings*.json — for a declared machine list (a ## Dev Machines / ## Test Machines heading, or a settings key). Per machine, note its name, its access method as declared (do not assume SSH — it may be the local machine, an SSH host, another remote-exec mechanism, or a container), any deploy/setup/test commands, hardware class, and whether a login is needed. Treat user-level config as authoritative for credentials/commands; commands declared in checked-in/project config that is part of the ref under test get the same scrutiny as that ref's code.
Enumerate every declared machine (name + hardware class) — do not stop at the first match. Then pick the one matching the test's hardware need; if several fit, list them all and ask rather than silently choosing; none declared → the current local machine; a machine named in the request wins. If the test targets hardware that only one machine has (e.g. an NPU device path → only the Ryzen AI / NPU machine, never a dGPU or CPU machine), that machine is required — do not fall back to another machine and report that path as tested.
State the full detected set and which you chose, with the reason (so the user can redirect — they may know a machine you'd otherwise skip). Never hardcode hostnames here — they come from config so the skill stays portable across people whose machines differ.

Phase 2 — Plan + the single approval gate

Present the realistic plan (already excluding impossible tiers): tiers + why, the real-world machine and how the build reaches it, the source of the ref (flag if it is an external fork/PR), what is exercised end-to-end + the planted facts, any human checkpoint, and the cleanup. Then ask once: "Run this? (real-world tier will install/run on <machine>.)" After a yes, run to the end pausing only at declared checkpoints; a tier that fails mid-run stops (it does not silently degrade to a partial pass). Unit/integration-only runs skip this gate and just execute.

Phase 3 — Tier 1: Unit (local)

Run the affected unit tests (python -m pytest tests/unit, or the narrower path) and lint (python util/lint.py --all, or the relevant subset). Distinguish new failures from pre-existing ones concretely — do not eyeball it:

git stash && python -m pytest <failing-subset> 2>&1 | tee <run-dir>/unit_base.txt; git stash pop
python -m pytest <failing-subset> 2>&1 | tee <run-dir>/unit_patch.txt

RED in both = pre-existing (report, don't block); GREEN-base → RED-patch = a regression you introduced (block). Record this verdict — Phase 6 references it.

Phase 4 — Tier 2: Integration (local)

Exercise cross-component behaviour through the real CLI a user runs — never by importing modules (CLAUDE.md "Testing Philosophy"). If Phase 0 flagged an LLM-affecting change, run the eval here:

Start the eval backend first — python -m gaia.ui.server --port 4200 --host 127.0.0.1 (background) — and confirm it answers before running the eval. gaia eval agent targets localhost:4200; with nothing listening, every scenario returns INFRA_ERROR and looks (wrongly) like a model failure.
Run the eval, then diff its scorecard against the committed baseline. --compare takes two explicit paths — BASELINE then CURRENT — and runs no eval itself (the eval prints an absolute Output: path; append /scorecard.json to it for the CURRENT arg):
```
gaia eval agent --category <cat> --agent-type <type>   # prints an absolute path, e.g. Output: /…/gaia/eval/results/<run-id>/
gaia eval agent --compare \
  tests/fixtures/eval_baselines/<model>-<hash>/scorecard_<cat>.json \
  <printed-output-path>/scorecard.json
```
Pick the BASELINE matching the model under test (the dirs are <model>-<hash>; ls tests/fixtures/eval_baselines/*/scorecard_<cat>.json) — do not sort by mtime (ls -t): a fresh clone stamps every baseline with the checkout time, so -t picks arbitrarily.
Regression rule: a category dropping materially below baseline (beyond run-to-run noise) blocks; an intentional capability removal must be re-baselined (--save-baseline) and called out in the report. An invalid run (concurrent eval, wrong ctx, mid-run model swap) is "invalid — re-run", not a result.
Stop this backend before Phase 5 (kill the :4200 process) so the real-world tier brings up its own clean instance rather than inheriting integration-tier state.

Phase 5 — Tier 3: Real-world (on the chosen machine)

Deploy + bring up. First clear any partial state from a prior failed run on the target (stale processes, half-downloaded model caches, bound ports). Get the build onto the target (clone/checkout the ref, install, build any frontend) and run setup — locally if the target is local (the common case), else via its declared access method. If the ref is an external fork/PR, honour the untrusted-code Hard rule. Start services as detached/background processes (or under a terminal multiplexer) so they outlive the executor. (For Lemonade startup gotchas — port conflicts, model loading — see lemonade-client-patterns.md and CLAUDE.md rather than re-deriving them here.)
Confirm the hardware is actually used — a health 200 is not enough. Read the backend's own device line from its startup log (e.g. an inference backend reporting using device <GPU name> and layers offloaded to GPU). rocm-smi/nvidia-smi may be absent (a Vulkan backend has no ROCm userspace) — fall back to VRAM via sysfs (/sys/class/drm/card*/device/mem_info_vram_used) or the backend log. If inference is on CPU, the tier is "not exercised on GPU", not a pass.
Drive the real surface live, and capture it. Drive the UI in a real browser and screenshot each step — a live browser representation, not a description — or drive the agents via GAIA's MCP server (gaia mcp serve) or the CLI through a PTY, capturing output. Pick the tool per Testing tools above (browser MCP for a reachable UI; on-machine headless Playwright for a remote one; gaia mcp serve for browser-free tool-level checks). A UI feature still needs a browser screenshot for the proof rule. Exercise the feature end-to-end; for input-type features (RAG document formats, etc.) exercise every supported type named in the spec/release notes, not just the happy-path one. Inject the planted facts. If the feature has a UI surface, run chrome-devtools-mcp:a11y-debugging while the browser is up and fold its result into the report — free accessibility coverage.
One GPU + one model slot → run scenarios sequentially (concurrent heavy runs race-evict each other's models — CLAUDE.md "Run agent evals SERIALLY").
Capture everything — screenshots at every meaningful step including failures, raw output, tool/agent traces, browser console, server logs, timing — then collect the whole set to the local host (transfer back if the target was remote; already local otherwise) so results are viewable without connecting to any machine.

Phase 6 — Judge & deliver

The judge does not rubber-stamp the executor's report.

Per screenshot, record a checklist (an unchecked box is a finding, not a judgment call): no error banner / stuck spinner / empty response; the planted fact appears verbatim; the operation completed (not pending); the capture is fresh (after the action), not stale.
Cross-check claims against source at the tested ref (gh api .../contents/<path>?ref=<ref> or the local checkout). For any CLI or release-note behaviour claim ("gaia X does Y", "command Z exists"), run the command and read it in src/gaia/cli.py at that ref — never accept it from the notes.
Verify planted facts in two places: the screenshot (the UI rendered it) and the raw agent/CLI trace (grep the captured trace). A fact in the screenshot but absent from the trace can mean a cached/hallucinated response, not live retrieval.
Check timing against the thresholds; surface outliers with their hardware context.
Deliver: push the decisive screenshots to the user via the file-send tool (each captioned with what it proves), plus a report table (Tier | Verdict | Evidence). State plainly any tier that was not truly exercised — a warm cache, a skipped login, a CPU fallback. "Not exercised" ≠ "passed". Lead with the finding.

Phase 7 — Cleanup

Tear down what the run created on the target; leave pre-existing services, caches, and the user's existing install intact. Confirm concretely — do not just assert it:

ps aux | grep -E "lemonade|llama-server|gaia.ui.server" | grep -v grep   # only pre-existing should remain
ls <throwaway-clone-dir> 2>/dev/null && echo LEAKED || echo clean        # throwaway dir gone
curl -s -o /dev/null -w "%{http_code}\n" <pre-existing-service-url>       # pre-existing still healthy

If the run died mid-deploy, also check for a corrupt/partial model cache before declaring clean. Keep the local artifact directory so the user can revisit screenshots. Note anything deliberately left (e.g. a system package installed for a build) and offer to remove it.

Evidence & artifact conventions

One per-run directory on the local host (under the system temp dir, or a project-local test-runs/), created mode 700 so captured secrets/data are not world-readable; with a shots/ subdir; a fresh dir per run.
Screenshots named NN_description.png; capture at every meaningful step and on failure. TUIs: capture the rendered text frame (a terminal multiplexer's capture) as the definitive record, plus a PNG.
Logs & traces: keep raw command output, the agent/tool-call trace, browser console, and server logs; quote raw output in the report rather than paraphrasing.
Surface, don't make them dig: push the few decisive screenshots to the user via the file-send tool with per-image captions; the rest stay in the dir, referenced by path.

Timing & anomaly conventions

Record: TTFT, tokens/sec, end-to-end response, document indexing time, server startup. Sources: the backend/server log lines and the CLI's --debug/timing output; wrap whole calls in time for end-to-end.
Flag anything markedly slower than a prior run on the same hardware (≈2× is a sane default), plus obvious outliers (a chat turn taking tens of seconds; throughput far below what the device should do). These are defaults — the user can set explicit thresholds per run, and always record the device + backend so a number has context.
Put a short timing table in the report. An anomaly is investigated or explained, never buried.