name: autoloop description: > MUST use when the user wants autonomous, iterative optimization — letting Claude run experiments unattended. Trigger: "autoloop", "autoresearch", "experiment loop", "hill-climbing", "optimize overnight", "karpathy loop", "let Claude optimize while I sleep", "automate trying different approaches", "set up a loop to improve this", "run experiments overnight", or any request for iterative improvement with a scalar metric. Generates program.md + immutable runner script (auto/run.sh) with tiered quality gates and structured METRIC output, ready to run with claude --dangerously-skip-permissions. effort: high allowed-tools: [Read, Write, Glob, Grep, Bash, Agent]
Autoloop — Autonomous Experiment Loop Generator
Goal
Turn an LLM coding agent into an autonomous scientist. Generate a self-contained program.md + auto/run.sh that lets the agent loop forever — edit code, run experiment, parse metric, git commit (keep) or git reset (revert) — while the human walks away.
The skill's job is the design thinking: mapping an arbitrary project onto the seven essential components that make this loop work, then generating the files. Getting the components right is the difference between a loop that runs 126 experiments overnight and one that crashes after 3.
Dependencies
Tools
autoloop:codebase-scout— Subagent that explores the project directory to identify build system, test commands, source files, and candidate metrics. Delegates viaAgent(subagent_type="autoloop:codebase-scout", model="haiku").git— Used for checkpoint/rollback (commit to keep, reset to revert). Must be available in the project.
Connectors
- Project build system — Whatever runs the experiments (pytest, cargo, npm, go test, etc.). Detected by the codebase-scout.
claudeCLI — The generated loop runs viaclaude --dangerously-skip-permissions -p "Read program.md and execute the loop protocol.".
Context
The Seven Essential Components
Every autoloop maps onto these seven components. There is no orchestration code — program.md IS the entire system.
- Mutable artifact — the one file the agent edits
- Immutable context — files the agent reads but never touches
- Primary metric — a single number that says "better" or "worse"
- Secondary metrics — numbers tracked for tradeoff monitoring (not optimized, prevents Goodhart's Law)
- Runner script — immutable shell script that runs quality gates and emits structured
METRICoutput - Quality gates — tiered checks (fast tests → conformance/lint → benchmark) that fail fast
- Checkpoint/rollback — git commit to keep, git reset to revert
Plus a results ledger (results.tsv) and an embedded progress log (in program.md itself) that give the agent full history every iteration.
Mutable Artifact Selection
| Goal | Likely mutable file |
|---|---|
| ML training improvement | The training script (train.py, train.rs) |
| Test coverage | The source files being tested (pick the lowest-coverage one) |
| Performance | The module containing the hot path |
| Lint score | Source files with the most violations |
| Prompt engineering | The prompt template file |
| Config tuning | The config file being tuned |
The mutable file should be small enough for the agent to read in one pass. If >500 lines, suggest a focused subset or ask the user to extract the relevant section.
Metric Inference
Check what the project already has:
| Project has... | Candidate metric |
|---|---|
| Tests | Test count, coverage %, pass rate |
| Benchmarks | Execution time, throughput, ops/sec |
| Linting | Ruff/pylint issue count (lower is better) |
| ML training | Validation loss, accuracy, perplexity |
| Eval suite | Accuracy, F1, score |
Common secondary metrics by domain (guardrails, NOT optimized):
| Primary metric | Good secondary metrics |
|---|---|
| Execution time (µs) | Allocations, memory usage, code complexity |
| Test coverage (%) | Test count, test execution time |
| Lint score | Lines of code, cyclomatic complexity |
| Validation loss | Training time, GPU memory, inference latency |
| Throughput (req/s) | P99 latency, error rate, CPU usage |
Quality Gate Design
Gates run before the benchmark, ordered fastest-first. Early gate failure → immediate exit → no wasted benchmark time.
| Gate | Purpose | Failure mode | Example |
|---|---|---|---|
| Unit tests (fast) | Correctness | Hard fail (exit 1) | uv run pytest tests/unit -x |
| Conformance/lint | Style + spec | Soft fail with threshold | ruff check --statistics, allow ≤N issues |
| Type check | Type safety | Hard fail | uv run mypy src/ |
Use what the project already has — don't add new tooling.
Domain Strategies
For detailed allowed change types per domain (ML, test coverage, performance, lint, prompts, config tuning), consult:
→ references/domain-examples.md
Process
Step 0: Load Stored Feedback
Load any stored feedback preferences before starting:
python ${CLAUDE_PLUGIN_ROOT}/scripts/feedback_manager.py autoloop show-feedback
If feedback entries exist, apply the returned preferences (loop_design, metrics, quality_gates, runner_script, time_budget, change_strategy, general) throughout loop design.
Step 1: Scout the Project
Delegate to the codebase-scout agent:
Agent(
subagent_type="autoloop:codebase-scout",
model="haiku",
prompt="Explore {cwd} and return a structured summary of: project type, language, build/test/bench commands, source files, config files, candidate metrics, and immutable files. See your instructions for the full output format.",
description="Scout project for autoloop"
)
Tell the user: "I'm exploring your project to understand the build system, test infrastructure, and what metrics we can optimize. This takes about 15 seconds."
When results come back, summarize in 3-5 bullet points. Don't dump the raw output.
Step 2: Design the Loop
Think hard before committing to the design. This is the highest-leverage decision in the skill: a wrong artifact, metric, or gate wastes hours of unattended runtime. Reason explicitly through the trade-offs of each component — and how they interact — before presenting anything. Using the scout results AND the user's stated goal, design all seven components.
2a. Infer the mutable artifact — Use the selection table from Context. If the answer isn't obvious, present 2-3 options with trade-offs.
2b. Infer the metric — Use the metric inference tables from Context. Determine the direction: "lowest" (minimize) or "highest" (maximize). Identify 1-3 secondary metrics as guardrails.
STOP if no metric can be inferred. Do not guess. Ask the user: "I can see how to run experiments, but I can't determine what metric to optimize. What number should I be trying to improve? It needs to be something I can parse from command output."
2c. Infer the execution command — Usually comes directly from scout results. The command should redirect output to a log file: {cmd} > run.log 2>&1.
2d. Design the time budget:
- Fast tests (<30s): budget 1 min, timeout 3 min
- Medium tests (1-5 min): budget to match, timeout 2x
- Slow training (>5 min): budget to match, timeout 3x
- Very slow (>30 min): warn the user that fewer experiments will run overnight
2e. Define files in scope and off limits — Be specific with paths. "Don't touch tests" is vague; test/**/*.py — test suite, must continue to pass unchanged is clear.
2f. Define allowed change types — Read the appropriate domain block from references/domain-examples.md.
2g. Design quality gates — Use the gate design table from Context. For each gate, determine: command, failure mode (hard/soft), threshold (for soft fails).
Human Checkpoint: Present the Design
Present the complete design as a single summary:
## Autoloop Design
**Goal**: {what we're optimizing}
**Mutable file**: `{path}` — {description}
**Primary metric**: {metric_name} ({units}, {direction} is better)
**Secondary metrics**: {name1} ({units}), {name2} ({units}) — tracked, not optimized
**Quality gates**:
1. {gate1_name}: `{command}` — {hard/soft fail}
2. {gate2_name}: `{command}` — {hard/soft fail, threshold if soft}
3. Benchmark: `{bench_command}`
**Time budget**: ~{budget} per experiment (timeout: {timeout})
**Files in scope**: {list}
**Off limits**: {list}
**Strategy**: {domain} — {brief description of change types}
Does this look right? I'll adjust anything before generating.
Wait for user confirmation before proceeding.
Step 3: Generate Runner Script and Verify Baseline
3a. Generate auto/run.sh — Read references/runner-script-template.sh and fill in quality gates + metric extraction from the design.
mkdir -p auto
The runner script structure:
- Shebang + set flags:
#!/usr/bin/env bash+set -euo pipefail - cd to project root:
cd "$(dirname "$0")/.." - Quality gates in order (fastest first)
- Benchmark: runs the metric-producing command
- METRIC output: prints
METRIC key=valuelines
Make it executable: chmod +x auto/run.sh
3b. Verify baseline — Run the script once and check:
./auto/run.sh > run.log 2>&1
echo "Exit code: $?"
grep '^METRIC ' run.log
Verify: exit code 0, METRIC lines present, values reasonable (not NaN, not 0 when shouldn't be).
Do not proceed to generation until the baseline passes. If anything fails, debug it with the user.
Record the baseline commit hash: git rev-parse --short HEAD
Step 4: Generate program.md
4a. Read the template — Read references/program-md-template.md.
4b. Read domain strategy — Read the appropriate section from references/domain-examples.md.
4c. Fill variables — Replace all {VARIABLE} placeholders with values from the design.
For the complete variable mapping, consult:
→ references/program-md-template.md (variables are documented inline)
Human Checkpoint: Preview and Confirm
Show the user the generated program.md content:
"Here's the program.md I'll write to your project root. Review it — once you confirm, I'll create the files."
Wait for user confirmation before writing.
On confirmation, write:
program.mdto the project rootresults.tsvwith just the header row- Add
results.tsvandrun.logto.gitignore(append if exists, create if not, skip if already listed)
Do NOT git commit. Leave that to the user.
Output
Generated Files
| File | Purpose | Mutable by agent? |
|---|---|---|
auto/run.sh |
Quality gates + METRIC output | Never |
program.md |
Loop instructions + embedded progress log | Progress log only |
results.tsv |
Experiment ledger (append-only) | Append only, never committed |
Launch Instructions
Print to the user after file generation:
## Ready to Launch
To start:
1. Review `auto/run.sh` and `program.md`.
2. Start the loop:
claude --dangerously-skip-permissions -p "Read program.md and execute the loop protocol. Do not stop until I interrupt you."
3. Walk away. The agent will:
- Create a branch (autoloop/{tag})
- Establish baseline via ./auto/run.sh
- Loop: edit → run → measure → keep/revert
- Log every experiment to results.tsv
- Update the Progress Log in program.md
4. When you come back:
cat results.tsv # Full experiment trajectory
grep '^- ' program.md | tail -20 # Progress log of kept changes
git log --oneline # Which iterations were kept
git diff main..HEAD # Cumulative changes
5. If you like the results:
git checkout main
git merge autoloop/{tag} # Or cherry-pick specific commits
Troubleshooting
For common issues (agent stops early, every experiment crashes, metric not improving, etc.), consult:
→ references/troubleshooting.md