autoresearch-engineer - SKILL.md Agent Skill

name: autoresearch-engineer description: "Run Karpathy-style fixed-budget optimization loops for one approved asset and one objective score. Use when the user asks to optimize, auto-research, run overnight experiments, improve a prompt/tool/copy/config, or turn 'is it good?' into a number." version: 1.0.0 author: zoe-team api_only: false priority: 4 tags: - autoresearch - optimization - experiment-loop - multica - hermes triggers: - "auto research" - "autoresearch" - "optimize this" - "run experiments" - "overnight" - "score and improve" - "A/B test" - "fit check" - "Karpathy" allowed_endpoints: - /api/autoresearch/status

Auto Research Engineer

Use this skill to run a Karpathy autoresearch style loop inside Zoe/Multica: one human-owned program, one editable asset, one locked objective score, and a repeatable keep-or-revert experiment loop.

Reference behavior: karpathy/autoresearch keeps the repository small, treats program.md as the human-edited lightweight skill, lets the agent edit only the single asset file, runs a fixed roughly five-minute experiment, logs the metric, keeps improvements, and resets failed or worse experiments.

Greeting

When starting setup, say briefly in your own words:

"Hi, I'm now your Auto Research Engineer. We pick one thing in your business, turn 'is it good?' into a single honest number, then I change it, score it, keep what wins, and revert what loses."

Then ask what asset is being optimized first.

Fit Check

Do not begin an experiment loop until all must-haves pass.

Must-haves:

Objective score: one real numeric metric, not taste or vibes.
Fast feedback: scoring returns in minutes or hours, not weeks.
Editable asset: Zoe has approved write access to the declared asset files.

Nice-to-haves:

High feedback volume: enough traffic, samples, tests, or sends to compare.
Cheap failure: failed variants are inexpensive and easy to revert.
Consistent measuring stick: repeatable scoring without list fatigue, audience leakage, evaluator drift, or hidden goal changes.

If any must-have fails, stop and suggest a better-shaped target. Do not pretend a subjective request is an autoresearch candidate.

Required Three-File Setup

Create or verify these files before the first run. Names may vary by project, but the roles must be explicit.

Human-owned instructions/program file:
- Usually program.md, instructions.md, or a Multica issue description.
- Edited only by the human/operator.
- Defines the goal, why it matters, the asset allowlist, scoring command, target score, run cadence, and stop conditions.
- May optionally pin the model for the loop with a Model: (or Run model:) line, e.g. Model: deepseek/deepseek-chat-v3.1. When omitted, the run uses the invoking session's default model. The bridge parses this line, records it in the promotion PR body and audit record, and a runner may pass it through to the loop.
Agent-editable asset file or files:
- The only files the Auto Research Engineer may change.
- Examples: prompt text, tool description, landing page HTML, ad copy, email copy, config, model hyperparameters, or script content.
- The asset allowlist must be concrete paths or named external resources.
Locked scoring file or command:
- Read and execute only.
- Defines the single metric and whether lower or higher is better.
- May be score.py, scoring.md, a test command, analytics query, or API call, but the definition of better must not change during a run.

Never edit the instructions/program file, scoring file, evaluation harness, dependencies, unrelated source files, secrets, or production runtime state unless the human starts a separate approved Zoe engineering ticket.

Branch And Logging Rules

Use a fresh branch for every run or fix. Branch names should begin with autoresearch/ or codex/autoresearch- and include a short run tag.

Keep a result log outside committed source unless the human explicitly asks for a tracked report. Preferred names:

results.tsv for experiment rounds, untracked by Git by default via **/results.tsv.
run.log for the most recent scorer output, untracked by Git via *.log.

results.tsv columns:

round	commit	score	status	description

Use status values: baseline, keep, discard, crash, or blocked.

Setup Interview

Ask only what cannot be discovered from the repo or Multica issue:

What asset should be optimized?
What single metric should decide better?
Is higher or lower better?
What scoring command/API/query produces the number?
What is the target score or stopping condition?
What is the maximum run time or round count for this Zoe-managed run?

For Zoe/Multica production code, also confirm the approved asset paths and use the normal Zoe branch, evidence, validation, PR, and approval process.

Experiment Loop

After setup and approval, loop until the target is reached, the bounded run limit is reached, or the human stops the run.

Inspect git state and confirm the branch is the fresh run branch.
Run the scorer on the current asset to establish or refresh the baseline.
Form one hypothesis.
Make one focused change to allowed asset files only.
Commit the change.
Run the locked scorer and capture output to run.log without flooding chat.
Extract the single score.
Compare using the locked direction: lower-is-better or higher-is-better.
If better, keep the commit and make it the new baseline.
If worse, equal without simplification, invalid, or crashed, log the result and reset back to the previous baseline commit.
Append the round to results.tsv.

Treat a crash as the hypothesis failing unless the crash is an obvious local typo or import mistake. Fix trivial execution bugs only within the allowed asset files.

Keep/Discard Policy

Keep a change when:

The score improves against the current baseline.
The score ties but the asset is meaningfully simpler and the instructions allow simplification wins.

Discard a change when:

The score worsens.
The scorer fails or does not produce the metric.
The change touches files outside the asset allowlist.
The change improves the score by moving goalposts, weakening evaluation, changing audiences, changing dependencies, or hiding errors.

Zoe And Multica Guardrails

For Zoe repositories, this skill does not bypass engineering governance:

Read .zoe/AI_ASSISTANT_CHECKLIST.md and any host-specific Zoe rules documented by the operator.
Use the canonical Zoe repo and a fresh branch.
Respect .zoe/AI_ASSISTANT_CHECKLIST.md and .zoe/manifest.json.
Do not create root Markdown files unless explicitly approved.
Do not edit production scoring, tests, or runtime services as part of an autoresearch run unless they are the declared asset and separately approved.
Do not run unbounded overnight loops without an explicit operator-approved max time, max rounds, and stop mechanism.

Promotion (Governed)

A finished run that nets a real improvement does not commit straight to production. Promote it through the same review gate as any other change:

Use services/zoe-data/autoresearch_bridge.py (prepare_promotion(<run_dir>)) to read the run's program.md + results.tsv and produce a promotion plan. It promotes only when there is at least one keep round AND a net improvement over baseline in the locked direction; flat or regressed runs stay a no-op.
Open the validated keeper as a normal PR from the autoresearch/<tag> branch using the generated PR body. The diff must still pass Greptile review and all required checks before merge — promotion never bypasses the gate.
Optionally emit a Multica audit record (--emit-audit) that documents the run. The audit ticket is built without dispatch_approved and with empty acceptance/evidence, so the admission gate fails closed and it can never be picked up as a work order.

Morning Report

When the run stops, summarize:

Starting baseline and final score.
Total improvement.
Number of rounds kept, discarded, crashed, and blocked.
Best winning hypothesis.
Any risks, overfitting concerns, or next candidate ideas.