name: codex-autoresearch description: Run Codex Autoresearch end to end from one plugin skill. Use when Codex should start, resume, inspect, dashboard, deep-research, iterate, log, or finalize measured optimization loops using autoresearch.md, autoresearch.jsonl, quality_gap scratchpads, or the local CLI helpers.
Codex Autoresearch
This is the one skill surface and the only Codex-facing skill. Do not route users to old subskills, slash commands, or separate dashboard/finalizer skills.
Default state machine:
setup -> doctor -> next -> log -> state -> finalize-preview
The job: make one measured improvement loop trustworthy enough that a human can follow it and a future session can resume it.
For Codex
- Use the short command path unless the session is ambiguous or blocked:
setup,doctor,next,log,state, thenfinalize-preview. - For qualitative or deep-research improvement loops, start with
research-start --cwd <project> --slug <slug> --goal "<goal>". It creates the scratchpad, configuresquality_gap, validates the command, and records the first baseline asmeasureunless--no-baseline-logis passed. - Use advanced diagnostics only when needed. Check
node scripts/autoresearch.mjs --help --allfrom the package root before naming a less common command. - Use
new-segmentwhen the active segment is maxed, stale, phase-changing, or no longer comparable. - Prefer CLI JSON and durable session state over chat memory:
autoresearch.md,autoresearch.jsonl,autoresearch.ideas.md, active last-run/progress snapshots under.git/autoresearch/in Git repos, fallbackautoresearch.last-run.json/autoresearch.progress.jsonoutside Git, andautoresearch.research/<slug>/. - Keep every packet decision recoverable through
METRIC name=value, packet evidence, ASI, continuation data, promotion labels, and the ledger. - Before another packet, read
recommend-next --compactorstate --compact; obey blockers. Compact-state field names:docs/concepts.md#state-fields. benchmark-lintmust prove the primaryMETRICcontract before product packets are trusted.- Configure
commitPathsor pass--commit-pathsfor kept results in Git repos.
For the user
- Plain-language prompts work: "/goal @Codex Autoresearch improve this repo."
- Ask only for essentials that materially change setup: goal, benchmark, primary metric, direction, scope, or correctness checks.
- For shippable, product, or final requests, identify product claims before setup. Retrieval, search, ranking, lazy behavior, accessibility, safety, or performance work needs a quality constraint or checks path before promotion.
- Stay on the CLI happy path unless setup is ambiguous, the user asks for the dashboard, packet freshness needs a browser readout, or the canonical action is blocked.
- Report the story: what was tried, what the metric means, the keep/discard/measure/crash/checks decision, the next move, blockers, optional dashboard URL, and verification.
Documentation awareness
Use docs only as needed; do not load everything by default.
- Start/resume or normal operation:
docs/start.md,docs/operate.md, andreferences/loop-operations.md. - Dashboard, trust, drift, protected paths, unsafe commands, and redaction:
docs/trust.md,docs/architecture.md, andreferences/dashboard-trust.md. - Deep research, quality gaps, fanout, finalization, or subagent handoffs:
docs/finish.md,docs/workflows.md, andreferences/research-finalize.md. - Troubleshooting:
docs/troubleshooting.md. - Control-plane failures or cross-surface disagreement:
docs/control-plane.md.
Start or resume
- Identify the owning repo or child package before Git, installs, tests, builds, or autoresearch commands.
- Check Git status and work around unrelated dirty files.
- If this repo is the target, use the repo-local plugin. From the wrapper root:
node plugins/codex-autoresearch/scripts/autoresearch.mjs .... From the package root:node scripts/autoresearch.mjs .... - Read
autoresearch.md,autoresearch.jsonl, andautoresearch.ideas.mdwhen present. - Use
setup-planfor read-only setup guidance when essentials are unclear. Usesetuponly when essentials are known and files should be created. - Run
doctor --cwd <project> --check-benchmark --explainbefore the first trusted packet or any drift-sensitive metric. - Use the happy path first:
setup -> doctor -> next -> log -> state -> finalize-preview. - Before another packet, read
recommend-next --compactorstate --compact; obey blockers; open detailed diagnostics only when the canonical action is blocked, stale, or unclear. - Use
state --reportfor a terminal-firstreport.text. Governance fields are listed indocs/concepts.md#state-fields. - Run
serve --cwd <project>, verify liveness, and provide the live dashboard URL only when the user asks, the browser readout matters, or CLI state is not enough. - For retrieval/search/ranking/performance work, require quality constraints before promotion.
- Treat optional
task_manifestpacket evidence as audit data; quarantine malformed manifests and path escapes without invalidating unrelated metric evidence. - Treat benchmark-shaped fixes as diagnostic until proven otherwise. Row-specific detector or citation work is diagnostic repair until holdout, repeat, breadth, or promotion gate proves the broader claim.
- If
session-forensicsimports benchmark-overfit or row-specific steering feedback, treat the decision capsule as a trust blocker. - Treat runtime freshness as unavailable unless installed runtime version and built-entrypoint fingerprint can be inspected and matched.
Happy-path CLI from plugins/codex-autoresearch:
node scripts/autoresearch.mjs setup --cwd <project> --name "<session>" --metric-name <metric> --direction lower --benchmark-command "<command>"
node scripts/autoresearch.mjs doctor --cwd <project> --check-benchmark --explain
node scripts/autoresearch.mjs next --cwd <project>
node scripts/autoresearch.mjs log --cwd <project> --from-last --status measure --description "Baseline measurement"
node scripts/autoresearch.mjs state --cwd <project> --report
node scripts/autoresearch.mjs finalize-preview --cwd <project>
Active loop contract
After next, log the packet. After log, read the returned continuation object.
- Only
nextwrites a reusable last-run packet.runremains a raw benchmark probe. - Use
log --from-lastinstead of retyping parsed metrics. keep, ordinarydiscard, andmeasurerequire a finite primary metric.- Use
measurefor non-promotional evidence: baselines, no-change probes, environment checks, and diagnostics. crashandchecks_failedcan be logged without inventing sentinel metrics.- Treat
review_requiredmetrics as provisional until ASI acknowledges the review outcome. - If
autoresearch.config.jsoncontainsfixedControl, treat the named artifact as control truth. Do not rerun commands matchingforbiddenCommandPatternsunless the user explicitly accepts--allow-fixed-control-rerun; preferreuseCommandHint. - If run numbers duplicate, segments look stale, or manual log entries were edited, run
ledger-doctor --cwd <project> --jsonbefore another packet. Useledger-doctor --repair --yesonly after reviewing the JSON health summary; after repair, verify the returnedbackupPath. - Read parsed metrics and promotion readiness separately. New keeps default to exploratory unless repeat, holdout, breadth, or explicit promotion metadata make the evidence promotable.
- The loop contract is the authority for whether to spend another packet.
sourceCleanliness.blocks.nextPacket=falseonly says source dirtiness is not the blocker. - Control-plane contracts are packet brakes too: goal mismatches, missing scoped approvals, stale process residue, unsupported broad claims, and unsafe finalization runways outrank another packet.
- When the metric improves because the benchmark was steered toward known answers, say so.
- If
continuation.shouldContinueis true, choose the next hypothesis from ASI, experiment memory,autoresearch.ideas.md, or dashboard lane guidance. - If
continuation.forbidFinalAnsweris true, continue with progress updates instead of returning a final answer. - Respect packet and wall-clock budgets.
- If correctness checks fail, run
checks-inspectbefore deciding. - Stop when the user interrupts, the limit or budget is reached, benchmark/checks are blocked, cleanup would be unsafe, a fresh segment is needed, or the goal is genuinely exhausted.
Codex-only Goal completion
- Use
completionAuditbefore a parent agent callsupdate_goal(status="complete"). - Do not complete a parent Codex Goal while the continuation says the loop is still active.
- Keep Goal state in Codex; Autoresearch only provides
codex-goal-briefand completion-audit evidence.
CLI fallback:
node scripts/autoresearch.mjs next --cwd <project> --compact
node scripts/autoresearch.mjs log --cwd <project> --from-last --status keep --description "Describe the kept change"
node scripts/autoresearch.mjs state --cwd <project> --report
node scripts/autoresearch.mjs state --cwd <project> --compact
Dashboard
Use the served dashboard when a live readout is useful:
- Use
scripts/autoresearch.mjs serve --cwd <project>. - Share the served
http://127.0.0.1:<port>/URL by default. - Static exports are read-only snapshots; serve a fresh dashboard when packet freshness matters.
- Readout only. Use the CLI to do the work.
- The live server accepts only loopback Host headers, sends defensive headers, and keeps the raw ledger endpoint disabled unless
--debug-ledgeris explicitly used.
Deep research loops
Use a deep-research loop for broad, qualitative, product-study, UX, architecture, or documentation prompts. Study, accept gaps, measure quality_gap, close credible candidates, then start a fresh round when the question is still alive.
- Start with
research-start --cwd <project> --slug <slug> --goal "<goal>". It seedsautoresearch.research/<slug>/, configures thequality_gapbenchmark, validates the command, records the first baseline asmeasure, and prints the resume commands. Use--no-baseline-logonly when that first baseline should not be recorded. - Keep sources dated and claim-specific in
autoresearch.research/<slug>/sources.md. - Write the judgment pass in
synthesis.md: filter hallucinations, separate evidence from inference. - Turn accepted findings into
quality-gaps.md. - Measure with
quality-gap --cwd <project> --research-slug <slug> --list. - Preview candidates with
gap-candidates; apply only credible high-impact gaps. - Log implementation or rejection with ASI.
- Start a fresh round before claiming there are no more high-impact gaps.
quality_gap=0 only means the accepted checklist for the current round is closed. Read freshRoundSuggested, researchIntegrity, sourceCleanliness, finalization readiness, and plateau reason fields before deciding next steps.
For crashed or timed-out packets with artifact rows, use partial-results --from-last before rerunning expensive work.
Finalize
Use finalization when noisy loop history has useful kept commits.
- Run
finalize-preview --cwd <project>before branch creation. - Keep only accepted/current
status: "keep"evidence. - Compare product claim coverage against accepted evidence.
- If coverage is missing, report experimental status. Use "Experimental review branch only: product-grade proof is missing."
- Treat previews and plans as read-only.
- Review dirty tree, stale plan, overlap, semantic safety, and excluded-file warnings.
- Session artifacts are excluded by default. Use
--include-session-artifactsonly when the reviewer explicitly wants them. - When state reports
current-tree-finalization, runfinalize-current-tree --cwd <project> --exclude-session-artifacts. Do not substitute genericfinalize-previewas the primary command. - Ask before creating branches unless the user already approved finalization.
- Runway order: preview, approve, create review branches, verify, merge into trunk, verify the merge, cleanup.
- Do not suggest branch cleanup until merge verification has succeeded.
- Classify existing review branches before reuse.
- Report created review branches, files, metric improvement, claim coverage, verification, runway status, and remaining risk.
Subagent handoffs
When Codex uses subagents to work on Autoresearch itself:
- Each lane states scope, evidence source, decision, handoff artifact, and tests.
- No nested subagents.
- Do not run overlapping write lanes. Split by ownership first, then merge through one parent context.
- Reviewers should check the decision-envelope contract, packet freshness, dashboard read-only behavior, finalization artifact policy, and docs/changelog sync.
Verification
Use the narrowest relevant check while iterating. Before claiming plugin work is done, run from plugins/codex-autoresearch:
npm run check
Targeted checks:
npm test
node scripts/autoresearch.mjs --help
node scripts/autoresearch.mjs doctor --cwd . --check-benchmark --explain
node scripts/autoresearch.mjs benchmark-lint --cwd .
node scripts/autoresearch.mjs checks-inspect --cwd . --command "npm test"
git diff --check