name: testing description: "E2E and validation testing strategy. Load this skill when: writing or running tests, debugging test failures, touching TUI/runtime/Messenger code, running castor check, needing full command reference, or setting up DB-touching tests. Covers test groups, isolation, controller E2E, TUI E2E, real LLM smoke tests, failure diagnostics, and DB test setup."
Testing Strategy
Castor command reference
All PHPUnit invocations include --stop-on-error --stop-on-failure --fail-on-all-issues --display-all-issues.
castor check # Full QA gate (deterministic — no live LLM): deptrac, unit/integration (ParaTest), controller replay E2E, TUI replay E2E, phpstan, cs-check; per-step timeouts + logs at var/reports/check-*.log
castor test # unit/integration tests (ParaTest parallel by default); excludes tui-e2e-replay, llm-real, recording, and controller-replay groups
castor test --filter=X # filter tests by name
castor test --suite=X # target a specific phpunit.xml test suite (ParaTest parallel)
castor test:tui [--filter=X] # TUI E2E journey tests (replay-backed, no live LLM)
castor test:tui-update [--filter=X] # update TUI snapshot baselines (filter optional)
castor test:llm-real [--filter=X] # real llama.cpp smoke (filter optional)
castor test:controller [--filter=X] # controller E2E smoke test (live LLM, opt-in)
castor test:controller-replay # controller E2E smoke tests with replay fixtures (no live LLM, default controller validation)
castor llm:fixtures:record # Re-record LLM replay fixtures from live LLM
castor llm:fixtures:info # List available LLM replay fixtures
castor deptrac # architecture boundary validation
castor phpstan [path] # static analysis (optionally scoped to a path)
castor phpstan:baseline # regenerate phpstan baseline
castor cs-fix [path] # auto-fix coding style
castor cs-check # check coding style (dry-run)
castor phar:build # Build hatfield.phar (worktree-local by default)
castor phar:ensure # Ensure PHAR exists (build if missing or stale)
castor phar:clean # Remove worktree-local hatfield.phar
Test LLM
Live LLM smoke tests (opt-in) use llama_cpp_test/test (port 9052). This is a fast local model for deterministic provider compatibility testing. Never use production LLM providers in E2E tests. Default E2E tests (controller replay, TUI replay) use deterministic pre-recorded fixtures and do NOT require llama.cpp.
Run the test llama.cpp server deterministically for smoke tests: temperature 0, fixed seed, and the test alias on port 9052. The smoke model is expected to answer/tool-call within a few seconds; long 30-60s waits usually hide a bad prompt, stale worker, or stuck process rather than real model latency.
LLM generation readiness preflight
Before castor test:llm-real and castor test:controller run live-LLM tests,
Castor runs check_llm_generation_ready() — a ~4s curl-based preflight that
sends a tiny max_tokens=1 chat completion to llama_cpp_test/test. If the
server responds to /health and /v1/models but generation hangs (corrupted
model load, stuck slots), this preflight fails immediately with a clear
diagnostic instead of burning 30-90s Castor step timeouts.
This preflight is NOT run by castor check (which is fully deterministic
and replay-backed).
If you see:
llama.cpp generation readiness check FAILED
Endpoint: http://192.168.2.38:9052/v1/chat/completions
Model: test
HTTP status: 0 (curl exit: 28)
Restart or fix the llama.cpp server. Health-only checks are insufficient.
HTTP timeout fallback
SymfonyAiProviderFactory injects a default 30s HttpClient timeout for all LLM requests when no explicit timeout is configured, preventing infinite hangs. The test environment (config/services_test.yaml) overrides this to 5s. The HATFIELD_LLM_HTTP_TIMEOUT env var allows per-environment override.
LLM Replay (deterministic, no live LLM)
Most tests that would otherwise hit a live LLM endpoint use instead
pre-recorded fixture files under tests/AgentCore/Fixtures/traces/.
- Replay mode is the default for
castor test. No live LLM calls. - Live mode is opt-in:
castor test:llm-real,castor test:controller, andcastor llm:fixtures:record. - Re-record fixtures when provider behavior, prompts, or tool schemas
change:
castor llm:fixtures:record. - Fixture format and recording/replay architecture described in
docs/llm-replay.md. Replay test helpers live intests/AgentCore/Infrastructure/SymfonyAi/Replay/.
Test groups
#[Group('llm-real')]— all tests that hit a real LLM endpoint#[Group('tui-e2e-replay')]— TUI journey tests (replay-backed, default and only TUI group)#[Group('phar')]— PHAR smoke tests (PharSmokeTest)
PHAR-based testing
Controller subprocess tests that use the live LLM path run against the built PHAR.
Castor test tasks (test:llm-real, test:controller)
automatically call phar:ensure first and set HATFIELD_BINARY_PATH so
AgentTestExecutable resolves the PHAR path. If PHAR build fails, these
test tasks skip gracefully (PHAR ensure failure is non-fatal).
Controller replay tests (test:controller-replay) and all TUI E2E tests
(test:tui, test:tui-update) use source bin/console and do not
require PHAR.
Pure unit/integration tests (castor test) remain source-based and do not
require PHAR. PHAR smoke tests (#[Group('phar')]) validate the built
artifact boots and responds to basic commands.
Run PHAR smoke tests manually:
castor phar:build
HATFIELD_BINARY_PATH=var/tmp/phar/hatfield.phar vendor/bin/phpunit --group phar
Isolation
All E2E tests must use var/tmp/test-{uuid} isolation. They must NOT read or write to the real .hatfield/sessions/ directory. On failure, tests dump session artifacts to stderr.
Per-suite DB isolation
castor test runs unit/integration tests with ParaTest by default (parallel
workers share the SQLite test DB safely via DAMA/DoctrineTestBundle
transaction isolation in WAL mode). Each ParaTest worker gets its own
compiled Symfony cache directory (via TEST_TOKEN in
tests/paratest-bootstrap.php). Filtered runs and non-ParaTest fallback
use a single shared DB sequentially.
castor check uses ParaTest for the unit/integration lane (excludes
E2E, live-LLM, recording, and PHAR groups).
- DB path:
HATFIELD_TEST_DATABASE_PATH(defaults toapp_test.sqlite). - ParaTest cache dir:
HATFIELD_CACHE_DIR=.hatfield/cache-paraT{token}(per-worker). doctrine:migrations:migrateruns once before the suite.- Standalone
vendor/bin/phpunitruns without Castor must exportHATFIELD_TEST_DATABASE_PATH=app_test.sqlite. - Filtered runs (
castor test --filter=...) use sequential PHPUnit (shared single DB).
What each command tests
| Command | What it tests | Requires |
|---|---|---|
castor check |
Full QA gate (deterministic): deptrac, unit/integration (ParaTest), controller replay E2E, TUI replay E2E, phpstan, cs-check. No live LLM, no PHAR. | tmux |
castor test |
Unit/integration tests (ParaTest parallel by default) | Nothing (pure PHP) |
castor test:llm-real |
Real LLM smoke: ControllerSmokeTest, LlamaCppSmokeTest (excludes recording group). Run as focused opt-in validation when changes touch provider/LLM-visible code — NOT required for every normal task. |
llama.cpp on port 9052 |
castor test:controller-replay |
Controller replay E2E: spawns --controller, JSONL protocol, replay fixtures (no live LLM) |
Nothing (pure PHP) |
castor test:controller |
Controller E2E: spawns --controller, JSONL protocol (live LLM, opt-in) |
llama.cpp on port 9052 |
castor test:tui |
TUI E2E journey tests (replay-backed, no live LLM) | tmux |
castor run:agent-test |
Interactive tmux session for manual inspection | tmux, llama.cpp on port 9052 |
castor run:agent |
Launch agent in tmux | tmux, LLM provider |
castor llm:fixtures:record |
Re-record replay fixtures from live LLM | llama.cpp on port 9052 |
castor llm:fixtures:info |
List available replay fixtures and metadata | Nothing (pure PHP) |
Controller E2E testing
Controller replay E2E (default, deterministic)
ControllerReplaySmokeTest (tests/CodingAgent/Runtime/Controller/E2E/):
Run with castor test:controller-replay. Does NOT require live LLM.
Extends ControllerReplayE2eTestCase, which:
- Spawns
bin/console agent --controllerwithAPP_ENV=test+HATFIELD_LLM_REPLAY_FIXTURE_PATH config/services_test.yamlwiresHttpClientInterfacethroughControllerReplayHttpClientFactory(tests/). When the env var is set, the factory returns a MockHttpClient with fixture-driven SSE. No production code insrc/checks the replay env var.- Uses pre-recorded fixture files (committed to repo) for deterministic responses
- Tracks process group PIDs and terminates the entire tree on teardown
- Does NOT require
LLAMA_CPP_SMOKE_TEST,HATFIELD_BINARY_PATH, or any live AI provider - Always uses the source
bin/console(not PHAR) so test-DI autoload works
Fixture format: same as docs/llm-replay.md.
Fixtures live in tests/CodingAgent/Runtime/Controller/E2E/fixtures/.
Process ownership:
- Controller + Messenger consumers tracked via /proc PID scanning
- Teardown: SIGTERM → 3s grace → SIGKILL for all tracked PIDs
- Diagnostics on failure: tracked PIDs, fixture count, process state
Controller live E2E (opt-in)
ControllerSmokeTest (tests/CodingAgent/Runtime/Controller/E2E/):
- Creates isolated
var/tmp/test-{uuid}with.hatfield/settings.yaml - Spawns
bin/console agent --controllervia proc_open - Waits for
runtime.readyevent on stdout - Sends
start_runJSONL command on stdin with a deterministic prompt - Reads JSONL events from stdout, collecting until the event that proves the behavior under test.
- Asserts event sequence/proof:
runtime.readyreceivedcommand.ackreceived for start_runrun.startedreceived- for conversational smoke, assistant text/message events and terminal run state
- for tool smoke, the intended
tool_execution.started+ matchingtool_execution.completedbytool_call_id
- Verifies session artifacts (
state.json,events.jsonl) when relevant - On failure, dumps all collected events, session artifacts, and messenger DB
This exercises the full async runtime pipeline:
- Controller event loop (Revolt
EventLoop::onReadable/repeat/onSignal) - Messenger consumer processes (run_control, llm, tool)
- LLM consumer stdout streaming of transient deltas
- Event drain and publish transport polling
Controller E2E wait strategy
Use the narrowest event proof instead of waiting for the whole run when the feature does not require it:
collectEventsUntil($eventType, 5.0)for a specific runtime event.collectEventsUntilToolCompleted($toolName, 5.0)for tool tests; it trackstool_call_idfromtool_execution.startedto the matchingtool_execution.completed.- Do not hard-require
run.completedfor tests whose real assertion is tool execution. The post-tool assistant turn can be slower or more variable than the tool path itself. - Prompts in
llm-realtests must name the exact tool and exact relative path, e.g.Call the tool named read exactly once with path ./file.txt. Avoid vague natural-language prompts that let the small model pick a different tool or shorten paths.
Failure diagnostics
On E2E test failure, the test dumps:
- All collected JSONL events (with types and count)
- Session artifacts:
state.json,events.jsonl - Messenger DB (
messenger.sqlite) with pending message counts per queue - Controller stderr output
Required runtime/TUI validation
For changes touching TUI runtime behavior, AgentSessionClient, model routing, Messenger wiring, TranscriptProjector, RuntimeEventPoller, transcript rendering, or LLM-visible execution flow, unit/container/mocked tests are not enough.
You MUST run castor check. It includes controller replay E2E and TUI replay E2E (both deterministic), so runtime/TUI/error-propagation changes exercise the controller process and interactive user-visible TUI path before handoff. Live LLM validation is opt-in via castor test:controller.
For especially risky visual or interaction changes, also run castor run:agent-test to drive the agent in tmux and capture snapshots.
Validation must exercise the real user flow: start agent, type prompt, submit, wait for visible assistant response or visible error block, and capture TUI snapshot plus session artifacts on failure. Do not claim runtime/TUI work is done based only on DTO tests, mocked pollers, container compilation, or isolated service tests.
If tmux is unavailable, TUI tasks MUST remain IN-PROGRESS with exact environmental blocker output — never mark CODE-REVIEW or DONE without it. The default castor check is deterministic and does NOT require llama.cpp.
Focused live LLM provider validation
castor check is deterministic and must NOT include castor test:llm-real by default. Run castor test:llm-real as opt-in focused validation when changes touch:
- Symfony AI provider/factory/platform integration
- LLM provider config, model catalog/resolution/routing/selection
- Tool schemas, tool-call conversion, or tool argument prompts
- LLM-visible system/developer prompts or prompt templates
- Live provider compatibility, streaming conversion, stop_reason/usage/tool-call deltas
- Controller live-provider path behavior where replay cannot prove provider compatibility
castor test:controller remains opt-in for live controller E2E when appropriate. Do NOT require live LLM validation for every normal task — only for provider/LLM-visible changes.
Before re-running failed controller/TUI E2E checks, kill stale worker processes from the failed worktree (messenger:consume, agent --controller, PHPUnit/Castor children). Orphaned consumers can keep queues busy and make a fixed test appear hung.
TUI E2E (replay-backed journey, default)
castor test:tui runs the deterministic replay-backed TUI journey test
(TuiJourneyE2eTest, group tui-e2e-replay). It exercises startup
layout, reasoning cycling, /hotkeys, shell !ls, file completion,
model interaction via replay fixtures, and double-bang rejection — all
in a single long-lived tmux session.
- Uses
APP_ENV=test+ sourcebin/console(not PHAR) soconfig/services_test.yamlwiresControllerReplayHttpClientFactoryfor deterministic model responses. - No live LLM, no
LLAMA_CPP_SMOKE_TEST, no PHAR. - Golden snapshot test (
TuiStartupSnapshotTest) also uses replay.
TUI E2E snapshot artifacts
After castor test:tui, passing test snapshots are kept at var/tmp/tui-e2e-*/ for inspection. Each isolated test directory contains:
.hatfield/tmp/tui/smoke/*.ansi— ANSI terminal snapshots captured bysaveAnsiSnapshot().hatfield/sessions/<id>/events.jsonl— canonical event log for resumed sessions
After failures, diagnostics go to var/tmp/tui-failures/ (ANSI snapshots + plain text dumps).
Run castor cleanup to remove all temp/test artifacts. See tests/AGENTS.md for full test standards: shared helpers, directory isolation, fast E2E waits, what not to test, one-class-per-file rules.
TUI E2E waits should target exact visible proof with short caps (typically 2-5s for startup/status/UI assertions on the local test model). Avoid broad 30-60s waits and fixed usleep() calls unless the delay itself is the behavior under test.
DB-touching tests
If a test touches the database, it is an integration test, not a unit test. Use KernelTestCase + static::getContainer() for EntityManager/repository/services. Do not use standalone ORMSetup/DriverManager/SchemaTool/EntityManager factories in tests. Test DB is configured via config/packages/test/doctrine.yaml; DAMA/DoctrineTestBundle wraps each test in a transaction for rollback isolation. Schema is created once before the suite runs, not per test. Load test data via container EntityManager or fixtures, not manual in-memory SQLite factories.