name: ir:onboarding-factory/assess
description: >
Judge one (agent, scenario) cell across the three pillars — agent capability,
daemon sensor capture, driver capability — on cited evidence, then author the
cell's recipe and machine-checkable spec. Writes the cell metadata via
of cell write and the spec (expected.jsonl) via of cell spec. No live
recording. Invoked as /ir:onboarding-factory assess <agent> <scenario>.
assess
You run as a focused subagent with no parent context. Do the research YOURSELF (web + file access) — don't bounce work back to the dispatcher. This verb spends NO API tokens on agent CLIs and runs NO recording. When done, return only the "Return contract" block.
What this produces
For one cell it writes two artifacts, both through the factory (never by hand):
- The cell —
of cell writewritesreplaydata/agents/<agent>/scenarios/<id>_<scenario>/metadata.json: the three-pillar verdict + confidence + a note, the full reasoning + caveats + sources, and the per-agent recipe (the driver step sequencerecordwill run). - The spec —
of cell specwritesexpected.jsonlin the same folder: the machine-checkable phases AND the observation assertions (model / cost / tokens / agent) thatrecord's verify step checks against the recording.
The route the dispatcher reads off of status is DERIVED from the three
pillars — see the routing table in ../return-contract.md.
The three pillars (judge each, cite each)
Read the pillar definitions in ../return-contract.md.
Three rules govern every verdict:
- Honest verdicts, anchored to evidence.
agent=yesonly when the docs/code state the behavior explicitly;nowhen something fundamental blocks it;unknownover a guess. Name the RIGHT owner on the daemon pillar:bug(product — file an issue) vsincapable(architecture) vs the driver pillar'sgap:<primitive>(tooling) each route differently. Don't park ambiguity inbugthe way a catch-allpartialonce was a dumping ground. - Caveats over downgrades. If the canonical spec is met but a narrow detail
is gappy, keep
daemon=fulland put the gap incaveats. A caveat is NOT abug: reservebugfor a spec-required observable the daemon mis-handles. - Cite primary sources. Agent docs, official changelog, agent source,
irrlicht adapter source, and — for
bug/incapable— the recording'sevents.jsonl. Tutorials and blogs don't count. Even anunknownverdict cites what you searched.
Evidence rule. Any verdict other than daemon=full / driver=ready MUST be
anchored in sources[] to a cited event or code/doc reference — never a
plausible-sounding mechanism. The bar for bug / incapable / gap:* is a
concrete citation; finer granularity multiplies the chance of mislabeling.
Steps
1. Read the scenario spec
of status --scenario <scenario> --json # the cell's current pillars + route
of scenario show --name <scenario> # the scenario's description + process + acceptance_criteria
Capture the user-observable signal the scenario asserts (from its
acceptance_criteria + process) — the state arc, counts, links, metrics. Each
is a candidate assertion you judge the daemon pillar against.
2. Confirm the agent's surface BEFORE writing agent=no
The bar for agent=no is higher than "<agent> --help doesn't mention it" —
many features live inside the REPL as slash commands or hooks, not top-level
flags. Before locking in no:
strings <agent-binary> | grep -iE "<feature>|/<slash>"for the feature's keywords — slash syntax, telemetry event names, preamble constants, error strings. This catches REPL-only features--helpnever lists. (The canonical miss:claude --helplacks--goal, butstrings $(which claude) | grep -i goalsurfaced the/goalautonomous-loop command — flipping the verdict toyes.)- Search the agent's docs / changelog / source repo for the same keywords. Vendor docs lag the binary; the binary's strings are authoritative for "what shipped."
- If the scan still finds nothing,
agent=nois honest — andsources[]MUST cite the empty binary scan so future audits don't re-litigate it.
3. Read the adapter transport (grounds the daemon pillar)
core/adapters/inbound/agents/<agent>/
agent.go # Source variant (FilesUnderRoot / FilesUnderCWD / ProcessOwnedStore), ProcessMatcher, PID discovery
<parser>.go # which event kinds the daemon can emit from this agent
For each user-observable signal from Step 1, ask "what event in events.jsonl
would prove this?" then "does this adapter's parser produce that event today?"
- yes for all, handled correctly →
daemon=full. - a trace exists but the daemon mis-handles a spec-required observable →
daemon=bug(cite the event; the cell recordsknown_failing, andrecordhands anissue:payload back for the dispatcher to file). - no trace at all (cloud session with no local file; behavior the 3-state model
can't represent) →
daemon=incapable.
Observation vs emission. "The agent performs the behavior" (agent) and
"the signal reaches the Source the daemon tails" (daemon) are DIFFERENT
questions — don't let a plausible parser read collapse them. A daemon=full
derived purely from reading the parser is PROVISIONAL for any property about
what the agent writes to its transcript (streaming, partial flushes,
ordering, atomicity): the parser may handle a trace the agent never emits. When
the verdict hinges on emission you can't confirm from docs, keep confidence
low, say so in caveats, and let record promote it from provisional to
settled — a live recording is the only thing that can.
4. Author the recipe + judge the driver pillar
Write the per-agent recipe (the driver step sequence) that elicits the
behavior, specializing the scenario's agent-agnostic process. Template from
claudecode's recipe for the same scenario when one exists. For a cell asserting
the full lifecycle arc, prefer an INTERACTIVE recipe when the agent's headless
mode exits at turn completion (the process must outlive the daemon's observation
window, or the settle/teardown phases validate as missing).
Every interactive (script) recipe MUST carry the two fields the driver reads
positionally: timeout_seconds (the per-cell turn budget in seconds — size
it to the scenario; 120 is the floor) and settings (the agent settings
blob, or {} when none). of cell write defaults both when omitted and of validate REJECTS a script recipe missing timeout_seconds — its absence once
reached a driver as the literal null and crashed it. Headless (prompt) and
applicable:false recipes don't need them.
Then judge the driver pillar against the agent's interactive driver:
source tools/onboarding-factory/scripts/lib/recipe-lint.sh
driver_step_types_from_file replaydata/agents/<agent>/driver-interactive.sh
If the recipe needs a step type the driver lacks (keys, reset_session,
restart, sigkill, …) → driver=gap:<primitive>. This is tooling work, NOT
an observability limit — don't let a driver gap masquerade as incapable. The
cell stays a real cell with a real recipe; record ports the missing step from
the reference driver before it drives. (First rule out a false gap: an
inline-argument slash command like /model <id> is a slash step, not a keys
gap.)
5. Author the spec (expected.jsonl)
Write the machine-checkable spec as JSONL. The first line is the meta object; subsequent lines are phases:
{"schema_version":1,"notes":"<what this asserts>","observations":{"model":"<id>","cost_nonzero":true,"tokens_nonzero":true,"agent":"<agent>"}}
{"phase":"birth","anchor":"start", ...}
{"phase":"settle","from":"working","to":"ready", ...}
(of cell spec forces scenario_id onto the meta line — you don't write it.)
- Phases assert the user-observable arc only: state transitions, distinct
session counts, parent-links, lifecycle. No internal flags, event kinds,
reasons, or rule numbers. Anchor the birth by the adapter's session model:
- Single-birth adapters (a stable session_id from launch — e.g.
claudecode): anchor the FIRST phase to
"start"UNPINNED so a transientproc-<PID>presession row can't steal the birth and cascade failures. - Presession adapters (a transient
proc-<PID>row reconciles into a real session with a DIFFERENT session_id — codex, gemini-cli): model the birth as TWO phases and anchor every post-birth phase to the REAL session, never the presession row (which never goesworking):
Collapsing these into a single birth (so{"phase":"presession_birth","expected_state":"ready","relative_to":"start"} {"phase":"session_birth","expected_state":"ready","relative_to":"presession_birth","new_session":true} ... every later phase: "same_session_as":"session_birth" ...session_birthlands on the presession proc-row, which never goesworking) is the miss that verified nearly every multi-phase presession cell PARTIAL untilrecordre-anchored it. Template from the codex sibling spec.
- Single-birth adapters (a stable session_id from launch — e.g.
claudecode): anchor the FIRST phase to
- Observations assert the websocket metric vector the verify engine checks —
exact-match categorical fields (
model,agent), non-zero + tolerance forcost/tokens. This is the widened verify the factory added: a recording is verified on token/usage/cost/model, not just lifecycle state. Don't hard-pin a doc-guessedmodel. Vendor docs lag the binary, and a wrongmodelexact-match fails every recording untilrecordcorrects it (assessedgemini-3-flash-preview; realitygemini-3.5-flash). Pinmodelonly when you can confirm it from the live surface (the binary's strings, a shipped config default); otherwise leave it provisional and letrecordfill it from the recording. Never pin acostfigure — assert non-zero only. - For a
daemon=bugcell, setknown_failingin the meta and keep the spec asserting the CORRECT behavior — never weaken it to match the bug.
6. Write both artifacts through the factory
# metadata.json: the verdict lives in details.assessment; recipe in details.recipe
of cell write --agent <agent> --scenario <scenario> --file /tmp/<agent>-<scenario>.metadata.json
# expected.jsonl: the spec
of cell spec --agent <agent> --scenario <scenario> --file /tmp/<agent>-<scenario>.expected.jsonl
of validate
The metadata.json shape. details.assessment is the verdict of record — it
MUST carry the three pillar enums + confidence alongside the reasoning, because
the matrix reads its routing/disposition straight from there. The metadata
overview tier is DERIVED: of cell write mirrors the pillars + confidence from
details.assessment into it, so you don't hand-write (or risk drifting) the
overview copy — fill notes/version fields there and leave the pillars to the
mirror. (of cell write also forces scenario_id.)
{
"metadata": {
"notes": "<one-line excerpt of the verdict>",
"agent_cli_version": "<x.y.z>", "daemon_version": "<x.y.z+sha>"
},
"details": {
"assessment": {
"schema_version": 1, "scenario_id": "<scenario>", "agent": "<agent>",
"agent_supports": "yes", "daemon_capability": "full", "driver_capability": "ready",
"confidence": 0.8,
"body": "## Verdict ...markdown reasoning...",
"caveats": ["..."],
"sources": [{"kind":"url|file","ref":"...","note":"..."}]
},
"recipe": { "timeout_seconds": 120, "settings": {}, "script": [ {"type":"send","text":"..."}, {"type":"wait_turn"}, {"type":"sleep","seconds":10} ] }
}
}
7. Surface recording prerequisites — but do NOT commit
If recording this cell needs a human action (auth switch, env var, mock,
unavailable provider) name it — it becomes prereqs in your return and the
dispatcher relays it to the human. If the cell is recordable now, prereqs: none.
Do NOT git commit. You ran in a parallel assess wave, and N subagents
committing the one worktree at once race (scrambled attribution, stranded
resets). Write both artifacts via of and return — the dispatcher stages
and commits each cell serially after the wave (it knows your cell from the
dispatch). Leaving replaydata/ dirty for the parent is correct here.
If the route is frozen, you're done after the write — the metadata
documents why the cell is frozen and what would unblock it; no spec phases are
needed beyond the meta. driver-gap and record / record-known-failing
all keep the recipe + spec so record can proceed (the driver-gap cell records
the moment record ports the missing step).
Return contract
Return ONLY this (≤6 lines). Shared semantics + envelope rules live in
../return-contract.md:
verdict: agent=<v> daemon=<full|bug|incapable|n/a> driver=<ready|gap:*> (confidence <n>)
route: record | record-known-failing | driver-gap | frozen
summary: <one sentence — the load-bearing reason, citing the anchoring event/code for any non-full/non-ready verdict>
wrote: metadata.json + expected.jsonl (via of cell write / of cell spec) — UNCOMMITTED; the dispatcher commits
prereqs: <human action recording needs, or "none">
Anti-patterns
- Don't write
replaydata/by hand.of cell write+of cell specare the only writers; they validate and force the FK. - Don't reach for
bug/incapablefor a narrow gap. Spec met →daemonstaysfull; usecaveats. - Don't conflate the two observability axes. A missing driver step is
driver=gap:<prim>(tooling), neverdaemon=incapable(architecture) — mislabeling routes the fix to the wrong owner. - Don't fabricate sources. An empty/honest
sourceswith lowconfidencebeats a fake citation that poisons future re-assessments. - Don't pin a doc-guessed
model/cost. Confirmmodelfrom the live surface or leave it provisional forrecordto fill; assertcostnon-zero, never a figure. A doc-derived model string fails every recording's verify. - Don't set
confidence≥ 0.9 from general knowledge. That band is for "the docs literally say this" / "the source has the exact behavior."0.7–0.85is the honest band for a thorough multi-source read. - Don't run a recording. That's
record's job; this verb is doc + code research plus the spec.