name: "GA Chatbot AFK Harness" description: "End-to-end autonomous (AFK) harness that develops AND QAs the GA chatbot. Orchestrates the existing Level-3 /auto-optimize dev loop + the semantic /ga-chatbot-qa-panel judge workflow, instruments every step to a dashboard the human can watch, and enforces branch-only / never-merge safety. Use to run unattended chatbot-quality improvement with full human visibility. Refuses to run if the backend is down, a killswitch/halt marker is present, or scope_boundary is violated." allowed-tools: Read, Write, Edit, Glob, Grep, Bash, Agent, AskUserQuestion, Skill last_verified: 2026-05-29
/ga-chatbot-afk-harness
The orchestration layer for fully-AFK GA chatbot development + QA. It does NOT reimplement the improvement loop — it wires together pieces that already exist and adds the three things they lacked: a semantic QA signal, live instrumentation for a human dashboard, and one safe "arm and walk away" entry.
┌──────────────────────── this harness ────────────────────────┐
preflight → │ QA snapshot dev cycles QA snapshot │ → exit
│ (det oracle + (/auto-optimize, (re-measure) │
│ semantic panel) Iron Laws + rollback) │
└───────────────── status → dashboard at every step ───────────┘
What it composes (do NOT duplicate these)
| Piece | Role | Owned by |
|---|---|---|
Scripts/run-prompt-corpus.ps1 |
deterministic oracle (pass_pct) |
existing |
/ga-chatbot-qa-panel (workflow) |
semantic judge panel → state/quality/chatbot-qa-semantic/ |
this session |
/auto-optimize |
Level-3 dev loop: pick worst → fix → roundtrip → commit/revert | existing |
/chatbot-qa-roundtrip-validate |
per-commit rollback gate | existing |
state/quality/chatbot-qa/baseline.json |
metric + scope_boundary + caps | existing |
Scripts/afk-harness-status.ps1 |
instrumentation writer → dashboard | this session |
Tools/afk-dashboard/index.html |
human UI | this session |
Inputs (skill args, key=value)
| Input | Default | Meaning |
|---|---|---|
target_metric |
0.97 |
exit when deterministic pass_pct ≥ this |
max_iterations |
10 |
hard cap on dev cycles (also bounded by baseline caps) |
batch_size |
1 |
/auto-optimize cycles per status-emit batch (smaller = livelier dashboard) |
branch |
afk/chatbot-qa-<date> |
the harness commits here; it NEVER pushes/merges |
semantic |
true |
run /ga-chatbot-qa-panel for the richer signal each snapshot |
Iron Laws (inherited + added)
Inherits ALL of /auto-optimize's Iron Laws (no commit without roundtrip pass;
no editing protected paths; caps max_commits_per_session=50,
max_wall_clock_minutes=480; always release .lock; honor
state/.loop-halted, state/quality/chatbot-qa/.STOP, and
~/.demerzel/HALT-ALL). The harness adds:
A. NEVER push, open/merge a PR, deploy, or apply a review-bypass label.
The harness produces COMMITS ON A BRANCH only. A human arms it and a
human (or a separate reviewed flow) lands it. This includes DELEGATED
flows: /auto-optimize's Step 4 normally runs `gh pr create` after a batch
— when the harness invokes it you MUST suppress that (see Step 2.0), so an
unattended run never opens a PR behind your back.
B. NEVER fabricate a metric. If the backend is unreachable, emit a
`blocked` status with the real reason and STOP. A degraded run is
recorded honestly, never as pass_pct=100%.
C. EMIT status via Scripts/afk-harness-status.ps1 at every phase boundary
and after every accepted/reverted cycle — code owns the status, so a
crashed loop still leaves an honest last state on the dashboard.
Procedure
Step 0 — Preflight (refuse-to-run gates)
- Pick
run_id = <UTCyyyyMMddTHHmmZ>-afk. Emitstate=preflight phase=preflight. - Halt/killswitch checks (reuse
/auto-optimizeStep 0 exactly): ifstate/.loop-halted,state/quality/chatbot-qa/.STOP, or a live~/.demerzel/HALT-ALLmarker is present → emitstate=killed, explain, STOP. - Backend preflight:
POST http://localhost:5252/api/chatbot/chatwith{"message":"What is a major triad"}. If not a real non-emptynaturalLanguageAnswer→ emitstate=blocked,-Blocker "backend_unavailable: GaChatbot.Api :5252", STOP. (Bring it up perdocs/runbooks/chatbot-deploy.md, then re-arm.) - Create/checkout
branch. Emitstate=running phase=qa-snapshot branch=<branch>.
SCALE — read this first. Both oracles report
pass_pcton a 0..100 scale (e.g.94,100).target_metric,afk-harness-status.ps1 -DetPct/-SemPct, and the dashboard are all 0..1 fractions. So you MUST divide everypass_pctby 100 before comparing it totarget_metricor passing it to the status writer. Skipping this makes94 >= 0.97true and the harness exits "done" on a 94% run (and the dashboard shows 9400%). Always normalize.
Step 1 — Baseline QA snapshot
- Run the deterministic oracle:
pwsh Scripts/run-prompt-corpus.ps1 -Snapshot. Readpass_pct(0..100) from the snapshot and computedet = pass_pct / 100→ pass as-DetPct det. - If
semantic=true: run the/ga-chatbot-qa-panelworkflow; read itsstate/quality/chatbot-qa-semantic/<date>.jsonpass_pct(also 0..100), computesem = pass_pct / 100→-SemPct sem. - Emit
phase=qa-snapshotwith both fractions +target_metric. Ifdet >= target_metricalready → skip to Step 3 (done).
Step 2.0 — suppress delegated PR creation (Law A). /auto-optimize's Step 4
opens a PR with gh pr create once commits land. The harness must NOT let that
happen. When you invoke it, append an explicit instruction: "Do NOT run Step 4 /
gh pr create / any push or PR — commit on the current branch only; the AFK
harness owns landing." If a given /auto-optimize version cannot be told to skip
its PR step, do NOT delegate — drive the cycle inline instead (pick worst prompt →
fix → /chatbot-qa-roundtrip-validate → git commit on branch / revert).
Step 2 — Improvement cycles (the AFK loop)
A stable id for THIS harness run so the AFK outer-loop rows (sub-step 4) cluster
together and stay distinct from the inner /auto-optimize per-cycle rows:
$afk_loop_id = "chatbot-qa-afk-$((Get-Date).ToUniversalTime().ToString('yyyyMMddTHHmmssZ'))".
Repeat until det >= target_metric (fractions, per the SCALE note),
max_iterations reached, plateau (baseline plateau_threshold), caps hit, or a
killswitch appears:
Run
/auto-optimize domain=chatbot-qa oracle_script_path=Scripts/run-prompt-corpus.ps1 baseline_path=state/quality/chatbot-qa/baseline.json max_iterations=<batch_size>with the no-PR instruction from Step 2.0. This does the real work: picks the worst prompt, proposes a fix, runs/chatbot-qa-roundtrip-validate, and commits on the branch or reverts. (The delegated/auto-optimizealready appends inner per-cycle rows tostate/quality/loops/chatbot-qa.iterations.jsonlvia its Step 3.8 — those carry the deterministicpass_pcttrajectory; do not duplicate them here.)After each batch, re-read
pass_pctand normalizedet = pass_pct / 100; emit one status update per outcome (always pass fractions to-DetPct):- accepted →
-Kind commit -Iteration <n> -Commits <c> -DetPct <x> -Event "accepted fix for <prompt-id> (<+Δpp>)" - reverted →
-Kind revert -Iteration <n> -Event "reverted <prompt-id>: <reject reason>"
- accepted →
Re-check killswitch/halt each iteration (Law C of auto-optimize). If set → emit
state=killed, STOP.Append the AFK outer-loop row to the loop ledger — the harness's value-add over the inner loop is the semantic signal, so record THAT (when
semantic=true) so the ix-duck loop lens can see semantic vs deterministic divergence across batches. One row per batch:# $sem_before/$sem_after are the semantic fractions from /ga-chatbot-qa-panel # this batch vs last; $worst is the worst-scoring prompt the panel flagged. # $verdict ∈ improved|regressed|plateau ; oracle couldn't run → couldnt_run. pwsh Scripts/loop-record.ps1 -LoopId $afk_loop_id -Domain chatbot-qa ` -Iteration $n -MetricName semantic_pass_pct ` -MetricBefore $sem_before -MetricAfter $sem_after ` -Verdict $verdict -OracleStatus $oracle_status ` -WorstItem $worst -CommitSha $shaSkip this row when
semantic=false(the deterministic trajectory is already covered by the inner/auto-optimizerows — don't duplicate it).
Step 3 — Final snapshot + exit
- Re-run the deterministic oracle and (if enabled)
/ga-chatbot-qa-panel. - Emit
state=done phase=exitwith finalDetPct/SemPct,commits, and a-Eventsummarizing outcome vs target. - Drop a
state/handoffs/<ts>-claude-code.mdnote: branch name, commits made, final metrics, and the explicit next human action (review + land the branch). Do not land it yourself.
Arming modes (how the human runs it AFK)
- Bounded, in-session:
/goal deterministic pass_pct >= 0.97 for chatbot corpusthen/ga-chatbot-afk-harness target_metric=0.97. Claude keeps working across turns until the goal evaluator confirms. - Interval:
/loop /ga-chatbot-afk-harness(only if repo preflight reportsLOOP_READY=true). - Headless / true AFK:
claude -p "/ga-chatbot-afk-harness target_metric=0.97 max_iterations=20"— runs unattended; tool prompts follow the configured allowlist.
Watching it (UI)
Serve the repo root and open the dashboard:
python -m http.server 8099 # from repo root
# → http://localhost:8099/Tools/afk-dashboard/
It auto-refreshes every 5s from state/quality/chatbot-qa/afk-runs/latest-status.json
- the run's
.jsonlevent log. Full details indocs/runbooks/ga-chatbot-afk-harness.md.
Stopping
Create state/quality/chatbot-qa/.STOP (this domain) or state/.loop-halted
(all loops). The harness checks both every iteration and exits gracefully,
leaving completed commits intact.