name: test-driving-rally description: Smoke-test rally for real to verify features work end-to-end. Use when the user wants to verify that rally is working correctly after changes or before a release. license: MIT compatibility: Requires rally built from source, plus at least one agent CLI (agy, claude, codex, gemini, opencode). metadata: author: rally version: "1.4"
Run real end-to-end smoke tests of rally to verify features work correctly. This skill drives rally from the CLI in isolated /tmp/ repos, observes actual behaviour, and reports findings.
Goal: High-signal smoke tests, not exhaustive QA. Prioritise breadth across features over depth on any one feature. When you find a defect that's small enough to fix in the same session, fix it; otherwise note it and move on.
Default workflow (unless the user explicitly overrides)
When the user invokes this skill, follow this loop:
- Update the skill first. Apply any new guidance the user just gave (model slugs, env state, behaviours). Also refresh any status-shaped text that's gone stale (e.g. "gemini is unauthenticated" when it now works). Commit the skill update separately at the end if it's substantive.
- Read prior session state. Check
./tmp/session-handoff.mdand the real-backend tests for what's already covered. Pick behaviours that are not yet tested or that need re-verification. - Test drive. Run the pre-built real-backend tests as a baseline, then do targeted manual smoke tests on the gaps.
- Apply fixes for any defects you find that are tractable in-session.
- Verify the fixes by re-running the relevant smoke test or adding a real-backend test.
- Bump the minor version in
internal/buildinfo/VERSIONand commit the code patch (one commit per logical change is fine), unless the user explicitly asks for patch or major. - Wrap up once confident or approaching ~300k context:
- Update the skill with any new findings (workflow gotchas, status corrections, new model behaviours).
- Write a new
./tmp/session-handoff.md(overwrite the previous — it's a single rolling doc). - Commit the docs update.
NEVER replace user-provided model slugs with older ones on a hunch. If the user names a model, save it verbatim into this skill and use it. Older slugs (e.g. gemini-2.5, gpt-4o, opencode-zen variants) are not valid in this environment. If the new slug appears not to work, report the failure mode — don't silently fall back.
Recording progress — important
Previous sessions have failed to update this skill correctly. To prevent recurrence:
- Section 1 has a "Current model slugs" subsection. Treat it as the source of truth. When the user provides new slugs, edit that list immediately, before doing any testing. Do not embed model slugs only in scattered examples below — update the canonical list.
- Section 3 ("Known environment-specific failures") must reflect current reality. If an agent that was previously broken now works (e.g. gemini after workspace-trust fix), delete the old failure note instead of stacking caveats on top of it. Stale "Gemini fails with exit 41 if no API key" lines mislead future sessions.
- The session handoff at
./tmp/session-handoff.mdis a single rolling doc, not an append log. Overwrite it each session. Do not create dated copies in/tmp/outside the repo. - Commit the skill update separately from code patches so reviewers can see the skill diff in isolation.
0. Reuse pre-built real-backend tests first
Before doing manual smoke tests, run the existing real-backend integration tests. They cover the core scenarios automatically:
RALLY_TEST_REAL_AGENTS=1 go test ./internal/relay/... -run TestRealBackend -v -timeout 600s
These tests skip automatically when RALLY_TEST_REAL_AGENTS is unset. They cover:
- Basic claude relay with file creation
- Laps queue integration
- Log scoping per-repo (two repos → two subdirectories in data_dir)
- Codex executor (checks for CLI arg conflicts)
- OpenCode executor (checks headless mode — no TUI ANSI in summary)
- Antigravity executor (checks
agy --print, settings-backed model selection, and conversation-id capture) - Resilience retry budget exhaustion and agent pausing
- Custom harness via
opencode run(no TUI, valid try record) - Multi-harness round-robin (
cc ge op→ one of each, in order)
If they all pass, proceed to the manual smoke tests below for broader coverage. If any fail, investigate before continuing — the pre-built tests are cheaper to run and faster to interpret than manual ones.
Add new tests to internal/relay/runner_real_backend_test.go whenever you find a new category of failure during manual testing.
1. Setup
Bump VERSION before testing patches. Any session that commits patches should increment the minor number in internal/buildinfo/VERSION by default (e.g. 0.7.0 → 0.8.0) unless the user explicitly asks for patch or major. Commit it so CI builds an updated binary for distribution. The file is embedded into the binary, so rally version on a dev build reports vX.Y.Z-dev. Do this once per session, before building:
# increment minor version, e.g.:
echo "0.8.0" > internal/buildinfo/VERSION
git add internal/buildinfo/VERSION && git commit -m "bump version to 0.8.0"
Build rally from source (do not rely on PATH rally — it may be a stale version):
go build -o /tmp/rally ./cmd/rally/
export PATH="/tmp:$PATH"
rally version
Check which agent CLIs are available:
which agy claude codex gemini opencode 2>/dev/null
Current model slugs (canonical list)
Always use these slugs in tests. They are the only slugs known to be available in this environment as of the latest session. If a slug fails, report it — do not fall back to older names.
| Harness | Slug | Notes |
|---|---|---|
ag/agy (antigravity) |
Gemini 3.5 Flash (High) |
Verified 2026-05-21 via agy --print; agy 1.0.0 has no CLI model flag, so Rally sets ~/.gemini/antigravity-cli/settings.json for the run and restores it. |
cc (claude) |
claude-haiku-4-5 |
Cheapest/fastest; default for smoke tests. |
cx (codex) |
gpt-5.4-mini |
Verified working (see TestRealBackend_CodexRelay). |
ge (gemini) |
gemini-3.1-pro-preview |
Previously verified, but unavailable on 2026-06-19: gemini-cli 0.40.1 returns IneligibleTierError / UNSUPPORTED_CLIENT for Gemini Code Assist individuals. Prefer Antigravity or other harnesses until account/client eligibility changes. |
ge (gemini) |
gemini-3-flash-preview |
Previously verified, but unavailable on 2026-06-19 for the same Gemini Code Assist IneligibleTierError; Rally should classify this as auth_or_proxy. |
op (opencode) |
opencode-go/kimi-k2.6 |
Monthly usage-limited on 2026-06-19. Live Rally probe classified usage_limit, parsed a ~96h reset, and benched quota scope opencode:opencode-go after the connected-idle path fired at ~5m. |
op (opencode) |
opencode/big-pickle |
Ongoing free smoke-test model. NOT opencode-zen/... — the zen prefix is wrong. |
op (opencode) |
zai-coding-plan/glm-5.1 |
Verified 2026-05-11: ~10s. The zai-coding-plan provider with glm-5.1 suffix. |
Current note (2026-06-19): opencode-go models are at monthly usage limit in this environment. Prefer non-opencode-go providers (opencode/big-pickle, zai-coding-plan/glm-5.1) for OpenCode success-path smoke tests until the monthly limit resets.
Alias note: Antigravity is ag or agy; gemini is ge, NOT gm. Rally rejects gm with unknown agent alias.
Check /workspace/rally.toml for the project-default slugs in use, and AGENTS.md for terminology. The slugs above override anything you see in older docs or memory.
2. Feature areas to cover
For each area, create an isolated git repo in /tmp/rally-test-<area>/, run rally init, write a .rally/config.toml, and run the test. Use short --iterations 1 or --iterations 2 relays so tests complete quickly. Use claude-haiku-* for claude tests (cheapest/fastest).
2a. Basic relay (claude)
Smoke test: single iteration, simple file creation task.
mkdir -p /tmp/rally-test-basic && cd /tmp/rally-test-basic
git init -q && git config user.email "t@t.com" && git config user.name "T"
touch init.txt && git add . && git commit -q -m "init"
rally init
# Write config with claude_model set
rally relay --new --iterations 1 --agent cc "Create an empty file called smoke-test.txt"
Check: exit 0, file exists, try record in .rally/state/tries.jsonl shows "completed": true.
2b. CLI monitor
Observed during any claude run. Look for:
⏱ Xs │ 📁 N files │ last activity: Xsstatus line updating~Nk toktoken estimate appearing after first activity⚠ slowingindicator if liveness probe fires- Keyboard hint line:
[Ctrl+S skip] [Ctrl+P pause] [Ctrl+X graceful stop] [Ctrl+C quit now]
2c. Config validation
# Schema version warning
cat > .rally/config.toml << 'EOF'
schema_version = 99
[defaults]
mix = "cc"
claude_model = "claude-haiku-*"
EOF
rally routes check # should warn about schema version
# Invalid harness name
cat > .rally/config.toml << 'EOF'
schema_version = 2
[harness.123bad]
command = ["echo"]
EOF
rally routes check # should error: invalid harness name
# Missing default route with routes configured
cat > .rally/config.toml << 'EOF'
schema_version = 2
[defaults]
claude_model = "claude-haiku-*"
[routes]
planner = ["cc:2"]
EOF
rally routes check # should warn: no default route
2d. Routes (role-based routing)
# Config with default route
[routes]
default = ["cc:2"]
planner = ["cc:2"]
executor = ["cc:1"]
Run rally routes check → confirm summary shows all routes. Then run a relay without --agent (uses default route). Confirm it runs correctly.
2e. Laps integration
Prerequisites: laps CLI installed and initialized.
mkdir -p /tmp/rally-test-laps && cd /tmp/rally-test-laps
git init -q && git config user.email "t@t.com" && git config user.name "T"
touch init.txt && git add . && git commit -q -m "init"
laps init && laps on
laps add head --title "Task 1" --description "Create file task1.txt with content 'done'"
laps add tail --title "Task 2" --description "Create file task2.txt with content 'done'"
rally init
# Write config
rally relay --new --iterations 2 --agent cc
Check: Rally auto-detects .laps/, prints "Installed rally hooks in ...". Both files created, laps list shows empty queue.
2f. Custom harness
Important: for opencode specifically, prefer the built-in op alias with opencode_model in defaults rather than a custom harness. The built-in executor uses opencode run <prompt> --format json (headless mode) and handles JSON output parsing. A custom harness using command = ["opencode"] starts TUI mode — it will not exit cleanly and the freeze detector will see spurious output. Rally warns about this at startup.
If you do need a custom harness for opencode, use headless mode explicitly:
[harness.mycode]
command = ["opencode", "run", "$PROMPT", "--format", "json"]
model_flag = "--model"
output_strategy = "tail"
output_lines = 50
tail_stream = "stdout"
[harness.mycode.models]
kimi = "opencode-go/kimi-k2.6"
mini = "opencode/big-pickle" # NOT opencode-zen
zai = "zai-coding-plan/glm-5.1"
The custom-harness path with opencode run has been verified in TestRealBackend_CustomHarnessRelay — it produces valid try records with no ANSI in summaries, and the relay record's agent_mix shows mycode. The TUI warning (section 3) is what protects against the bad config; don't disable it.
For any other CLI that accepts input on stdin and exits:
[harness.myagent]
command = ["myagent", "--no-interactive"]
model_flag = "--model"
output_strategy = "tail"
output_lines = 40
tail_stream = "combined"
Run rally relay --new --agent mycode:kimi "Create file custom-test.txt". Check relay record shows agent_mix containing mycode.
2g. Resilient execution
Pause semantics (changed 2026-05-28 in harden-relay-run-lifecycle): an agent
is only paused ("paused" event in agent_status.jsonl) when its failures are
classified FailureInfra AND there is more than one infra failure
(failureClass == FailureInfra && infraFailures > 1). A plain agent task-failure
(FailureAgent — e.g. the agent runs but makes no changes, or returns a non-infra
error) no longer pauses the agent; the scheduler still rotates off it within the
relay via OnAgentFailed, but cross-relay resilience-pause is reserved for repeated
infra problems (rate limits, connection refused, usage limits, harness/launch errors —
see internal/reliability/patterns.go). Classification reads the last 50 lines of the
try log file, so to force an infra classification in a fake-executor test, write an
infra-pattern line (e.g. rate limit) to opts.LogPath.
To verify pause-on-infra (e.g. codex CLI broken, or a rate-limited free-tier provider), check:
.rally/state/agent_status.jsonlcontains a"paused"event for that agent- Subsequent relay attempts for that agent show "all agents paused, waiting Xm" in the relay log
~/.local/share/rally/relays/relay-N.logfor confirmation
TestRealBackend_ResilienceRetryBudget (deterministic, fake executor) guards the
infra-pause path; TestRealBackend_OpenCodeRelay only requires a pause event when the
opencode failure was infra-classified (a plain "no changes" failure is a valid
non-paused outcome). Do NOT re-assert the old "any failure pauses" behavior.
2h. Resume and --new/--resume flags
Create an incomplete relay (kill mid-run or use --iterations 2 with an agent that partially fails). Then:
# Interactive prompt test (pipe "resume" or "new")
echo "resume" | rally relay --agent cc "..." # should show resume prompt
echo "new" | rally relay --agent cc "..." # should discard + restart
# Non-interactive flags
rally relay --resume --agent cc "..." # should resume silently
rally relay --new --agent cc "..." # should close old, start new
Check: Relay records in .rally/state/relays.jsonl — old relay gets ended_at set when --new is used.
2i. Weighted mix
rally relay --new --iterations 2 --agent "cc:2" "Create mix-test.txt"
Check: agent_mix in relay record shows the weight spec. Both iterations run with claude.
2j. Multi-harness relay (cc + other)
rally relay --new --iterations 3 --agent "cc ge op" "Create a unique file per iteration."
Watch the header line cycle through claude, gemini, opencode. The
agent_type field in state/tries.jsonl should also alternate. Regression
note (fixed in 0.7.4): prior versions stuck on the first harness because
the override path didn't inject a default quota for bare aliases.
TestRealBackend_MultiHarnessRoundRobin guards this. If you see all
iterations using the same agent, the override quota default has likely
regressed.
2k. Tail command
rally tail # stream latest try log (JSONL)
rally tail --try 1 # stream specific try
Note: rally tail uses the shared data_dir. The --try N number is the global try ID across all repos using the same data_dir — be aware of this when testing across multiple repos.
2l. Progress command
rally progress --summary "test done"
rally progress --complete --summary "all done" --followup "check results"
Check: .rally/summary.jsonl updated with new entries.
2m. Instructions command
rally instructions show # shows "(no project instructions set)" or content
2n. Rate-limit / stuck-agent scenario
To test how rally handles a rate-limited or hung agent, use a provider known to be rate-limited (or just a slow model). Configure with a short freeze threshold:
[reliability]
freeze_threshold_secs = 60
retry_budget = 1
Run and observe:
⚠ slowingfires at 36s (60% of threshold)- On Linux: two freeze paths exist:
- Classic (
classicFrozen): log silent ≥ threshold AND connections == 0. Fires once the agent's TCP connections drop. - Connected-frozen (
connectedFrozen): log silent ≥ threshold AND connections > 0 AND no syscall I/O (rchar+wchar) for 5 minutes. Catches rate-limited agents keeping a connection alive but sending no data.
- Classic (
- On non-Linux: only log silence is checked;
❄ frozenfires at threshold.
The per-try netstat log at ~/.local/share/rally/tries/<repo>/try-N.netstat.jsonl records {ts, log_silent_s, connections, io_bytes, syscall_bytes} each tick. Typical baselines:
- Simple task (file creation): connections 2-6, syscall delta 2-5 MB total
- npm install: connections 1-2 with massive syscall delta (400 MB–2 GB), sporadic "No I/O" warnings during download wait phases
- Rate-limited idle: connections > 0, syscall bytes plateau (< 1 MB/min), log silent
Check ~/.local/share/rally/relays/<repo>/relay-N.log for "freeze detected" vs no freeze.
3. Known environment-specific quirks
These are agent-CLI behaviours that affect how tests appear. None are rally bugs.
- Gemini: Not currently usable in this environment. On 2026-06-19, gemini-cli 0.40.1 returned
IneligibleTierError/UNSUPPORTED_CLIENT: the client is no longer supported for Gemini Code Assist individuals and suggests migrating to Antigravity. Rally should classify this asauth_or_proxy, notagent_error, and should end withall agents unavailablerather than frozen state when no fallback exists. - Codex:
--full-auto/--dangerously-bypass-approvals-and-sandboxconflict resolved (commit history) — only the bypass flag is passed now.TestRealBackend_CodexRelayguards this. - OpenCode: Model availability varies by provider. Use the built-in
opalias — NOT a custom harness withcommand = ["opencode"](which starts TUI mode). Rally warns on this at startup. For the current opencode-go monthly limit, opencode maintained a silent connected process until Rally's connected-idle path fired at ~5m; Rally then surfaced server-log-tail evidence asusage_limit, displayedusage limit, resets in 96h0m, and benched quota scopeopencode:opencode-go.
Session resume per harness (verified 2026-06-09)
Re-validated headless resume-with-prompt end-to-end on 2026-06-09 with a 2-step
codeword probe against the live CLIs (codex 0.136.0, opencode 1.16.2, agy 1.0.6): codex,
opencode, and antigravity each reused the SAME captured session id on step 2 and recalled
the step-1 codeword. All three pass a new prompt on resume (codex/agy/op accept a fresh
prompt alongside the resume flag), which is the headless analogue of the interactive
"continue this session with X" the user couldn't find a manual command for. Note the
interactive codex resume <id> (TUI) has no way to inject a prompt non-interactively —
the headless path codex exec resume <id> <prompt> is the one that works.
Resume reuses a harness's prior session on pause/resume and on any retry that has a
tracked session ID. The runner is harness-agnostic: it carries result.SessionID into
the next attempt's RunOptions.ResumeSessionID (runner.go ~`:999/:1439) and persists it to run-state. A harness contributes to resume only if it BOTH (a) captures its session ID into TryResult.SessionID, and (b) passes its resume flag when ResumeSessionID != ""`. Current truth:
| Harness | ResumeSupported() |
Captures session? | Resume flag | Verified |
|---|---|---|---|---|
| claude | true | ✅ | --resume <id> |
works |
| antigravity | true | ✅ | --conversation=<id> |
works |
| codex | true | ✅ (thread.started) |
exec [flags] resume <id> |
end-to-end CLI proven — same thread reused, prior context retained |
| opencode | true | ✅ (sessionID field, fixed 0.8.7) |
--session <id> |
CLI resume proven; capture was missing pre-0.8.7 |
| gemini | false | n/a | n/a (CLI --resume is index/latest only, not a session UUID) |
correctly honest-false |
Gotcha: gemini does have -r, --resume but it takes an index number or latest, not
a captured session UUID, so ResumeSupported()=false is the correct honest answer — do
not "fix" it to true. If you find a harness reporting ResumeSupported()=true whose
parseXxxOutput never sets TryResult.SessionID, resume is silently dead (that was the
opencode bug). Drive a real 2-step resume check: have the agent memorize a codeword in
try 1, then resume and ask for it.
Stdin gotcha when probing codex by hand: codex exec treats a piped, still-open stdin
as appended <stdin> input and hangs on "Reading additional input from stdin...". Rally
is unaffected (Go connects the child's stdin to /dev/null → immediate EOF), but a manual
probe launched under a shell/agent that holds stdin open will hang. Redirect < /dev/null
to faithfully mimic rally.
Linux freeze behavior: Two paths — classicFrozen (log silent + no connections) fires once connections drop (either after task completion or after rate-limit timeout). connectedFrozen (log silent + connections open + no syscall I/O for 5 min) catches agents holding a connection open indefinitely (e.g., current opencode-go monthly-limit behavior). Expect opencode-go limit probes to take about 5 minutes before Rally can classify and bench from server-log evidence.
When an agent CLI is broken/unauthed, verify that rally's retry and resilience handling works correctly (pause recorded, relay continues or ends gracefully) rather than treating it as a rally failure.
Test-artifact leak: some real-backend tests (e.g. OpenCodeRelay) can leave task
output files (opencode-e2e.txt, step-3.txt, …) in internal/relay/ because the
agent occasionally writes into the package dir rather than its temp workspace. After a
RALLY_TEST_REAL_AGENTS=1 run, check git status and git checkout/rm these strays
before committing — they are not real changes and opencode-e2e.txt is already an
accidentally-tracked artifact.
4. Reporting
After testing, compile a concise report:
## Smoke Test Results — rally vX.Y.Z
### Passed
- Basic claude relay: ✓ (file created, try record written, commit hash shown)
- Monitor status line: ✓ (⚠ slowing indicator, no token estimate)
- Config validation: ✓ (invalid harness name, missing default route, route-name-as-entry)
- Routes: ✓ (routes check, default route relay)
- Laps integration: ✓ (auto-detected, hooks installed, both tasks completed)
- Custom harness: ✓ (mycode:kimi resolved, relay ran, correct agent_mix)
- Resume interactive prompt: ✓ (detected unfinished relay, keep/overwrite mix)
- --new flag: ✓ (old relay closed, new relay started)
- --resume flag: ✓ (non-interactive, found paused agent, waited)
- Weighted mix cc:2: ✓ (2 claude iterations completed)
- Log scoping: ✓ (tries in data_dir/tries/REPOKEY/ per-workspace)
- [N/M] header: ✓ (shows iteration-within-relay / target, e.g. [1/3])
- Rally progress command: ✓ (summary.jsonl updated)
### Failed / Degraded
- OpenCode rate-limited models: hang silently; `classicFrozen` fires ~130s after connections drop, agent paused (working as intended)
### Observations
- `rally tail --try N` uses global try IDs from the shared data_dir; across multiple repos in the same session, try 1 from repo A and try 1 from repo B go to different subdirectories (fixed), but the `--try N` flag maps to local store IDs, not data-dir IDs
5. Keeping this skill current
Update this skill during the session — not as an afterthought at the very end. Concretely:
- Step 1 of the default workflow (section above) is to update the skill with any new user guidance before testing. Do this so the slug list, env-state notes, and workflow guidance are right when you reach for them later.
- Section 1's slug table is the canonical source. Edit it any time the user names a model, and propagate the change down into example snippets if needed.
- Delete stale failure notes in section 3 rather than layering caveats. If gemini works now, the "fails with exit 41" line is misleading — remove it.
- End-of-session pass: before writing the handoff, re-skim sections 1, 3, and 5 once more. Anything that was true at the start of the session but isn't anymore? Fix it.
What goes here vs. elsewhere:
- This skill: how to test-drive, current env state, environment-specific quirks, slug list.
- AGENTS.md / README.md / source: what rally does, terminology, release flow.
- Memory: durable cross-session preferences and references (e.g. "user prefers fixing root cause over patching symptoms"). Not slug lists or test recipes — those belong here.
- ./tmp/session-handoff.md: a single rolling doc with what this session did and what's outstanding. Overwrite each session.
Do not duplicate rally's own documentation here — the authoritative source is the source code, AGENTS.md, and README.md. This skill captures how to test-drive, not what rally can do.
6. Adversarial verification of an OpenSpec change (reference)
Use this when asked to verify that a specific OpenSpec change (e.g. a branch like
agent-lifecycle) is actually functional — especially when its tasks.md is already
all checked off. "All tasks checked" and "all tests green" are not evidence the
behavior works. Your job is to disprove that, then patch what breaks. Treat the change
as guilty until proven innocent.
6.1 Build the checklist from the spec, not the tasks
- Read
openspec/changes/<change>/{proposal,design,spec,tasks}.md. Thespec.mdRequirements + Scenarios are your test oracle — each#### Scenario:is a behavior you must observe directly. Thetasks.mdboxes tell you what the author intended to do, not what works. - For every scenario, write down the observable behavior that proves it, and the cheapest way to observe it for real (drive the CLI, inspect state files, watch the monitor). A scenario you can only "confirm" by reading code is not yet verified.
- Map each spec requirement to the code that implements it (grep the files named in the proposal's Impact section), then to the test that guards it. Note any requirement with no behavioral test — that's where bugs hide.
6.2 Behavioural testing over unit-trust
- Drive the real thing. Prefer running the actual agent CLI / a real
rally relayover trusting a unit test. The pre-built real-backend suite (section 0) is the baseline; add targeted manual drives for the change's specific behaviors. - Two-step behavioral probes beat assertions. To verify resume, have the agent memorize a codeword in try 1, then resume and ask for it back — if it answers, resume genuinely works (this is how codex/opencode resume were proven). To verify graceful cancel, send the signal and confirm the child actually dies within the window. Design a probe whose success is only possible if the behavior is real.
- Verify the WHOLE data-flow chain, both ends. Half-wired features are the #1
"looks-done" defect: a flag is passed but the value it depends on is never captured
(the opencode resume bug —
--sessionwas wired, butparseOpenCodeOutputnever capturedsessionID, so the session was always empty and resume silently no-op'd). For any "pass X downstream" feature, also confirm X is produced upstream. - Distrust contract tests that inject their own precondition. A test that sets
ResumeSessionIDand asserts the flag appears proves nothing about whether a session is ever captured in the wild. When a capability flag (ResumeSupported(),RotateSupported(), …) claimstrue, prove the capability is honest end-to-end, and prefer a contract test that fails if a new harness claims the capability without wiring the full chain (seeTestResumeSupportImpliesSessionCapture). - Confirm CLI invocations against reality. Don't assume an arg string is correct —
check it against the CLI's
--helpAND a real run. Subcommand/flag placement (codex exec [flags] resume <id> …), flag-takes-index-vs-id (gemini--resume), and client/server cwd resolution (opencode--dir) all bit this change. - Shake out races. Run suspect packages with
-count=3and-race; flaky "file already closed" / pipe errors are usually a real ordering bug (the codexcmd.Wait()-before-drain race), not noise.
6.3 Common "looks-done" defect classes (check each)
- Half-wired data flow — downstream consumer wired, upstream producer missing.
- Dishonest capability flag —
XxxSupported()==truebut the path no-ops. - Stale tests/docs from a prior change — a behavior changed (e.g. pause now needs
repeated infra failures) but old tests/skill text still assert the old contract.
git blamethe relevant condition to see whether a failure is a new regression or a pre-existing stale assertion, and fix the test/doc to the current behavior (don't re-assert the old one). - CLI arg assumptions — wrong subcommand/flag placement or semantics.
- Environment/isolation leaks — agent writes into the test/process cwd instead of its
workspace; check
git statusafter a real-backend run for strays.
6.4 Catch → patch → prove loop
When a probe fails, fix it in-session (the user expects fixes, not just a report):
- Root-cause it by tracing the data flow in code, not by patching the symptom.
- Make the minimal surgical fix matching surrounding style.
- Add a regression test that fails before and passes after — and prove it catches the bug: temporarily revert the fix, watch the test go red, restore it, watch it go green. (A regression test you never saw fail might be asserting nothing.)
- Re-run the unit suite, the relevant real-backend test, and your behavioral probe.
- Bump VERSION (patch for bugfixes on an in-flight change; see section 1) and commit code fixes separately from skill/doc updates.
6.5 Cover every harness; divide and conquer
The change usually touches all executors. Drive a simple task on each harness
(cc, cx, ge, op, ag/agy) plus the change-specific behavior. Use the current
slugs (section 1). For large permutation sets, fan out to short-lived subagents (one
harness or one scenario each) and keep each subagent task small and concrete — but do the
spec reading, checklist, and root-causing yourself so the judgment stays in one place.