name: "gaia-testing"
description: "GAIA's multi-tier regression harness — runs unit, integration, and real-world (on-machine) tiers and brings back screenshots, logs, traces, planted-fact retrieval proof, and per-operation timing with anomaly flags, all collected to the local machine (no remote login needed to view). Use this in the GAIA repo in place of the generic testing skill — it carries GAIA's exact commands, ports, and gaia eval agent baselines. Fires when the user wants real-world / on-hardware proof, screenshots, or end-to-end evidence that a feature, agent, change, fix, or release actually works — not a quick 'is the app running' check (the verify skill) or an LLM-behaviour scorecard alone (gaia eval agent). Scales: 'run the unit tests' stays unit-only and skips the planning gate; 'test / validate / QA this feature or release' runs all applicable tiers with evidence."
GAIA Testing
Test the way that catches regressions pytest misses: unit → integration → real-world, where the real-world tier drives the real interface on real hardware and returns screenshots, logs, traces, and timing as evidence. Features ship broken while CI is green — a RAG feature that worked in the backend but was hard-blocked in the UI, a release-note claim about a CLI command the source flatly contradicted. Runtime observation plus source cross-checking is the point.
Roles — the strongest model plans & judges, a faster model executes
- Plan + judge on the strongest available model (currently Opus): scope the work, and judge the evidence — read every screenshot's pixels, cross-check the executor's claims against source at the tested ref, confirm planted facts. The executor's report is a claim, never trusted on its face.
- Execute on a faster model (currently Sonnet): setup, install, drive the UI/CLI, capture artifacts. Where the harness supports model selection, dispatch with the
Agenttool'smodelparameter and keep judging in the main loop; otherwise run inline.
Testing tools — know these are available, and use them
Reach for these rather than hand-rolling — they are how the skill drives and observes a target:
- Browser-automation MCPs — drive and observe a UI. Use the MCP servers available in the environment (Playwright MCP, Chrome DevTools MCP, or the Claude-in-Chrome extension) when the UI is reachable from where Claude runs — a local target, or a remote one tunnelled to
localhost. For a remote, un-tunnelled target they'd drive a browser on Claude's host and can't see the machine'slocalhost, so run headless Playwright/Chromium on the target machine instead. - GAIA's Agent UI MCP server — drive the agents without a browser.
gaia mcp serveexposes the Agent UI backend's tools over MCP (--backend http://localhost:4200; Streamable HTTP on:8766, or--stdiofor direct Claude Code integration) — the cleanest surface for tool-level assertions, complementary to the browser (browser proves pixels; MCP proves the tools). Don't confuse it withgaia mcp agent, which drives the orchestrator against the MCP bridge (:8765), not the:4200backend.gaia eval agentis the LLM-behaviour scorecard (Phase 4).
These are availability-aware, not hard dependencies: use whichever the environment actually exposes, and say in the plan which you'll use.
When this fires — scale to the request
| Request | Tiers | Approval gate? |
|---|---|---|
| "run the unit tests", "does X lint/compile" | Unit only | No — just run + report |
| "test the API / this module" | Unit + integration | No |
| "test / validate / QA this feature/agent/fix/release", "does it really work", "real-world", "on hardware", "with screenshots" | All applicable tiers | Yes — before real-world |
Tiers that are impossible on this machine are decided in Phase 0 and excluded from the plan up front (stated, with the reason) — never silently dropped mid-run.
Hard rules (invariants — stated once here; phases point back)
- Evidence > summaries > source. Before reporting a pass, the judge reads the screenshot pixels and, for any behaviour/wiring claim, checks the code at the exact ref under test. This is the rule that catches shipped-but-broken features — do not treat the executor's prose as truth.
- Proof of a fix or implementation is a visual artifact, never a prose claim. "It works" is proven by a screenshot or a live browser representation — drive the real UI in a browser (Chrome / Chromium via the browser tooling) and capture it, step by step; for a CLI/API surface, capture the terminal output or the raw response. A change reported as working with no captured artifact has not been proven, and the artifact must be surfaced to the user (and, on a PR, attached to it).
- Verify against the ref under test, not
main— a fix onmainmay not be in the branch, and vice versa. - Prove retrieval/behaviour with planted, unguessable facts. For RAG / search / data flows, inject values a model cannot guess (e.g. mascot
Zephyr, passphraseviolet-otter-92, table cellAPAC 8610, speaker-notedesk 17C) and require the output to echo them. If the feature has no retrieval/data surface, say so in the plan — planted facts are N/A and the judge verifies by source inspection instead. Never silently skip. - No silent fallbacks / degrade loudly. An impossible tier is excluded up front; a tier that fails during execution stops with an actionable error — it never quietly becomes a partial "pass".
- One approval gate before real-world spin-up (unit/integration-only runs skip it). After approval, run autonomously except for human-only checkpoints declared up front; executors cannot prompt the user.
- Never leak credentials. Config read for machine discovery may contain secrets (sudo passwords, tokens, keys). Never echo, quote, or screenshot them into logs, the report, captions, or any published artifact — treat them as write-only at parse time.
- Sanitize artifacts before surfacing. Screenshots, logs, and traces can capture API keys,
.envvalues, tokens. Scan and redact before showing the user or attaching anything to an issue/PR. - Untrusted refs run unreviewed code. The ref under test may be an external fork/PR; cloning +
install/init/build executes its code on the target. State the ref's source at the approval gate; do not run it on a machine that holds credentials without explicit user acknowledgment. - Clean up (Phase 7). No Claude attribution in any artifact this skill produces.
Phase 0 — Scope + capability pre-flight
- Determine what changed (
git diff <base>... --stat); the diff is ground truth, any description of it is a claim. - Pick candidate tiers from the table.
- Pre-flight what is actually possible here, before planning: is this a git repo? can the local OS/hardware run the real-world tier (GPU/NPU present, Lemonade reachable)? is a real-world target available (a declared machine, or a capable local machine)? Exclude impossible tiers from the plan now and say why — e.g. "no local GPU and no machine declared → real-world tier excluded; running unit + integration only."
- LLM-affecting change? (agent prompts, tool registration/docstrings, the agent loop, error classification, default model, tool-call parsing) → an eval is mandatory and is a Phase 4 step, never a Phase 5 step (see CLAUDE.md "Run agent evals…").
Phase 1 — Machine discovery (real-world tier only)
Resolve the target from already-loaded configuration — do not grep the filesystem for it.
- Read the standard loaded config — user-level
~/.claude/CLAUDE.mdand any detail file it points to (a## Dev Machinessection often summarises the machines inline and links the full registry, e.g.~/.claude/memory/dev-machines.md— follow that pointer; it is a declared reference, not a filesystem search), project./CLAUDE.md, and.claude/settings*.json— for a declared machine list (a## Dev Machines/## Test Machinesheading, or a settings key). Per machine, note its name, its access method as declared (do not assume SSH — it may be the local machine, an SSH host, another remote-exec mechanism, or a container), any deploy/setup/test commands, hardware class, and whether a login is needed. Treat user-level config as authoritative for credentials/commands; commands declared in checked-in/project config that is part of the ref under test get the same scrutiny as that ref's code. - Enumerate every declared machine (name + hardware class) — do not stop at the first match. Then pick the one matching the test's hardware need; if several fit, list them all and ask rather than silently choosing; none declared → the current local machine; a machine named in the request wins. If the test targets hardware that only one machine has (e.g. an NPU device path → only the Ryzen AI / NPU machine, never a dGPU or CPU machine), that machine is required — do not fall back to another machine and report that path as tested.
- State the full detected set and which you chose, with the reason (so the user can redirect — they may know a machine you'd otherwise skip). Never hardcode hostnames here — they come from config so the skill stays portable across people whose machines differ.
Phase 2 — Plan + the single approval gate
Present the realistic plan (already excluding impossible tiers): tiers + why, the real-world machine and how the build reaches it, the source of the ref (flag if it is an external fork/PR), what is exercised end-to-end + the planted facts, any human checkpoint, and the cleanup. Then ask once: "Run this? (real-world tier will install/run on <machine>.)" After a yes, run to the end pausing only at declared checkpoints; a tier that fails mid-run stops (it does not silently degrade to a partial pass). Unit/integration-only runs skip this gate and just execute.
Phase 3 — Tier 1: Unit (local)
Run the affected unit tests (python -m pytest tests/unit, or the narrower path) and lint (python util/lint.py --all, or the relevant subset). Distinguish new failures from pre-existing ones concretely — do not eyeball it:
git stash && python -m pytest <failing-subset> 2>&1 | tee <run-dir>/unit_base.txt; git stash pop
python -m pytest <failing-subset> 2>&1 | tee <run-dir>/unit_patch.txt
RED in both = pre-existing (report, don't block); GREEN-base → RED-patch = a regression you introduced (block). Record this verdict — Phase 6 references it.
Phase 4 — Tier 2: Integration (local)
Exercise cross-component behaviour through the real CLI a user runs — never by importing modules (CLAUDE.md "Testing Philosophy"). If Phase 0 flagged an LLM-affecting change, run the eval here:
Start the eval backend first —
python -m gaia.ui.server --port 4200 --host 127.0.0.1(background) — and confirm it answers before running the eval.gaia eval agenttargetslocalhost:4200; with nothing listening, every scenario returnsINFRA_ERRORand looks (wrongly) like a model failure.Run the eval, then diff its scorecard against the committed baseline.
--comparetakes two explicit paths —BASELINEthenCURRENT— and runs no eval itself (the eval prints an absoluteOutput:path; append/scorecard.jsonto it for the CURRENT arg):gaia eval agent --category <cat> --agent-type <type> # prints an absolute path, e.g. Output: /…/gaia/eval/results/<run-id>/ gaia eval agent --compare \ tests/fixtures/eval_baselines/<model>-<hash>/scorecard_<cat>.json \ <printed-output-path>/scorecard.jsonPick the BASELINE matching the model under test (the dirs are
<model>-<hash>;ls tests/fixtures/eval_baselines/*/scorecard_<cat>.json) — do not sort by mtime (ls -t): a fresh clone stamps every baseline with the checkout time, so-tpicks arbitrarily.Regression rule: a category dropping materially below baseline (beyond run-to-run noise) blocks; an intentional capability removal must be re-baselined (
--save-baseline) and called out in the report. An invalid run (concurrent eval, wrong ctx, mid-run model swap) is "invalid — re-run", not a result.Stop this backend before Phase 5 (kill the :4200 process) so the real-world tier brings up its own clean instance rather than inheriting integration-tier state.
Phase 5 — Tier 3: Real-world (on the chosen machine)
- Deploy + bring up. First clear any partial state from a prior failed run on the target (stale processes, half-downloaded model caches, bound ports). Get the build onto the target (clone/checkout the ref, install, build any frontend) and run setup — locally if the target is local (the common case), else via its declared access method. If the ref is an external fork/PR, honour the untrusted-code Hard rule. Start services as detached/background processes (or under a terminal multiplexer) so they outlive the executor. (For Lemonade startup gotchas — port conflicts, model loading — see
lemonade-client-patterns.mdand CLAUDE.md rather than re-deriving them here.) - Confirm the hardware is actually used — a health 200 is not enough. Read the backend's own device line from its startup log (e.g. an inference backend reporting
using device <GPU name>and layers offloaded to GPU).rocm-smi/nvidia-smimay be absent (a Vulkan backend has no ROCm userspace) — fall back to VRAM via sysfs (/sys/class/drm/card*/device/mem_info_vram_used) or the backend log. If inference is on CPU, the tier is "not exercised on GPU", not a pass. - Drive the real surface live, and capture it. Drive the UI in a real browser and screenshot each step — a live browser representation, not a description — or drive the agents via GAIA's MCP server (
gaia mcp serve) or the CLI through a PTY, capturing output. Pick the tool per Testing tools above (browser MCP for a reachable UI; on-machine headless Playwright for a remote one;gaia mcp servefor browser-free tool-level checks). A UI feature still needs a browser screenshot for the proof rule. Exercise the feature end-to-end; for input-type features (RAG document formats, etc.) exercise every supported type named in the spec/release notes, not just the happy-path one. Inject the planted facts. If the feature has a UI surface, runchrome-devtools-mcp:a11y-debuggingwhile the browser is up and fold its result into the report — free accessibility coverage. - One GPU + one model slot → run scenarios sequentially (concurrent heavy runs race-evict each other's models — CLAUDE.md "Run agent evals SERIALLY").
- Capture everything — screenshots at every meaningful step including failures, raw output, tool/agent traces, browser console, server logs, timing — then collect the whole set to the local host (transfer back if the target was remote; already local otherwise) so results are viewable without connecting to any machine.
Phase 6 — Judge & deliver
The judge does not rubber-stamp the executor's report.
- Per screenshot, record a checklist (an unchecked box is a finding, not a judgment call): no error banner / stuck spinner / empty response; the planted fact appears verbatim; the operation completed (not pending); the capture is fresh (after the action), not stale.
- Cross-check claims against source at the tested ref (
gh api .../contents/<path>?ref=<ref>or the local checkout). For any CLI or release-note behaviour claim ("gaia Xdoes Y", "command Z exists"), run the command and read it insrc/gaia/cli.pyat that ref — never accept it from the notes. - Verify planted facts in two places: the screenshot (the UI rendered it) and the raw agent/CLI trace (
grepthe captured trace). A fact in the screenshot but absent from the trace can mean a cached/hallucinated response, not live retrieval. - Check timing against the thresholds; surface outliers with their hardware context.
- Deliver: push the decisive screenshots to the user via the file-send tool (each captioned with what it proves), plus a report table (
Tier | Verdict | Evidence). State plainly any tier that was not truly exercised — a warm cache, a skipped login, a CPU fallback. "Not exercised" ≠ "passed". Lead with the finding.
Phase 7 — Cleanup
Tear down what the run created on the target; leave pre-existing services, caches, and the user's existing install intact. Confirm concretely — do not just assert it:
ps aux | grep -E "lemonade|llama-server|gaia.ui.server" | grep -v grep # only pre-existing should remain
ls <throwaway-clone-dir> 2>/dev/null && echo LEAKED || echo clean # throwaway dir gone
curl -s -o /dev/null -w "%{http_code}\n" <pre-existing-service-url> # pre-existing still healthy
If the run died mid-deploy, also check for a corrupt/partial model cache before declaring clean. Keep the local artifact directory so the user can revisit screenshots. Note anything deliberately left (e.g. a system package installed for a build) and offer to remove it.
Evidence & artifact conventions
- One per-run directory on the local host (under the system temp dir, or a project-local
test-runs/), created mode 700 so captured secrets/data are not world-readable; with ashots/subdir; a fresh dir per run. - Screenshots named
NN_description.png; capture at every meaningful step and on failure. TUIs: capture the rendered text frame (a terminal multiplexer's capture) as the definitive record, plus a PNG. - Logs & traces: keep raw command output, the agent/tool-call trace, browser console, and server logs; quote raw output in the report rather than paraphrasing.
- Surface, don't make them dig: push the few decisive screenshots to the user via the file-send tool with per-image captions; the rest stay in the dir, referenced by path.
Timing & anomaly conventions
- Record: TTFT, tokens/sec, end-to-end response, document indexing time, server startup. Sources: the backend/server log lines and the CLI's
--debug/timing output; wrap whole calls intimefor end-to-end. - Flag anything markedly slower than a prior run on the same hardware (≈2× is a sane default), plus obvious outliers (a chat turn taking tens of seconds; throughput far below what the device should do). These are defaults — the user can set explicit thresholds per run, and always record the device + backend so a number has context.
- Put a short timing table in the report. An anomaly is investigated or explained, never buried.