autoresearch - SKILL.md Agent Skill

name: autoresearch description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat. 9 subcommands: plan, debug, fix, security, ship, scenario, predict, learn. version: 1.8.2 license: MIT metadata: author: uditgoenka/autoresearch version: "1.8.2"

Claude Autoresearch — Autonomous Goal-directed Iteration

Inspired by Karpathy's autoresearch. Applies constraint-driven autonomous iteration to ANY work — not just ML research.

Core idea: You are an autonomous agent. Modify → Verify → Keep/Discard → Repeat.

MANDATORY: Interactive Setup Gate

CRITICAL — READ THIS FIRST BEFORE ANY ACTION:

For ALL commands (/autoresearch, /autoresearch:plan, /autoresearch:debug, /autoresearch:fix, /autoresearch:security, /autoresearch:ship, /autoresearch:scenario, /autoresearch:predict, /autoresearch:learn):

Check if the user provided ALL required context inline (Goal, Scope, Metric, flags, etc.)
If ANY required context is missing → you MUST use AskUserQuestion to collect it BEFORE proceeding to any execution phase. DO NOT skip this step. DO NOT proceed without user input.
Each subcommand's reference file has an "Interactive Setup" section — follow it exactly when context is missing.

Command	Required Context	If Missing → Ask
`/autoresearch`	Goal, Scope, Metric, Direction, Verify	Batch 1 (4 questions) + Batch 2 (3 questions) from Setup Phase below
`/autoresearch:plan`	Goal	Ask via `AskUserQuestion` per `references/plan-workflow.md`
`/autoresearch:debug`	Issue/Symptom, Scope	4 batched questions per `references/debug-workflow.md`
`/autoresearch:fix`	Target, Scope	4 batched questions per `references/fix-workflow.md`
`/autoresearch:security`	Scope, Depth	3 batched questions per `references/security-workflow.md`
`/autoresearch:ship`	What/Type, Mode	3 batched questions per `references/ship-workflow.md`
`/autoresearch:scenario`	Scenario, Domain	4-8 adaptive questions per `references/scenario-workflow.md`
`/autoresearch:predict`	Scope, Goal	3-4 batched questions per `references/predict-workflow.md`
`/autoresearch:learn`	Mode, Scope	4 batched questions per `references/learn-workflow.md`

YOU MUST NOT start any loop, phase, or execution without completing interactive setup when context is missing. This is a BLOCKING prerequisite.

Subcommands

Subcommand	Purpose
`/autoresearch`	Run the autonomous loop (default)
`/autoresearch:plan`	Interactive wizard to build Scope, Metric, Direction & Verify from a Goal
`/autoresearch:security`	Autonomous security audit: STRIDE threat model + OWASP Top 10 + red-team (4 adversarial personas)
`/autoresearch:ship`	Universal shipping workflow: ship code, content, marketing, sales, research, or anything
`/autoresearch:debug`	Autonomous bug-hunting loop: scientific method + iterative investigation until codebase is clean
`/autoresearch:fix`	Autonomous fix loop: iteratively repair errors (tests, types, lint, build) until zero remain
`/autoresearch:scenario`	Scenario-driven use case generator: explore situations, edge cases, and derivative scenarios
`/autoresearch:predict`	Multi-persona swarm prediction: pre-analyze code from multiple expert perspectives before acting
`/autoresearch:learn`	Autonomous codebase documentation engine: scout, learn, generate/update docs with validation-fix loop

/autoresearch:security — Autonomous Security Audit

Runs a comprehensive security audit using the autoresearch loop pattern. Generates a full STRIDE threat model, maps attack surfaces, then iteratively tests each vulnerability vector — logging findings with severity, OWASP category, and code evidence.

Load: references/security-workflow.md for full protocol.

What it does:

Codebase Reconnaissance — scans tech stack, dependencies, configs, API routes
Asset Identification — catalogs data stores, auth systems, external services, user inputs
Trust Boundary Mapping — browser↔server, public↔authenticated, user↔admin, CI/CD↔prod
STRIDE Threat Model — Spoofing, Tampering, Repudiation, Info Disclosure, DoS, Elevation of Privilege
Attack Surface Map — entry points, data flows, abuse paths
Autonomous Loop — iteratively tests each vector, validates with code evidence, logs findings
Final Report — severity-ranked findings with mitigations, coverage matrix, iteration log

Key behaviors:

Follows red-team adversarial mindset (Security Adversary, Supply Chain, Insider Threat, Infra Attacker)
Every finding requires code evidence (file:line + attack scenario) — no theoretical fluff
Tracks OWASP Top 10 + STRIDE coverage, prints coverage summary every 5 iterations
Composite metric: (owasp_tested/10)*50 + (stride_tested/6)*30 + min(findings, 20) — higher is better
Creates security/{YYMMDD}-{HHMM}-{audit-slug}/ folder with structured reports: overview.md, threat-model.md, attack-surface-map.md, findings.md, owasp-coverage.md, dependency-audit.md, recommendations.md, security-audit-results.tsv

Flags:

Flag	Purpose
`--diff`	Delta mode — only audit files changed since last audit
`--fix`	After audit, auto-fix confirmed Critical/High findings using autoresearch loop
`--fail-on {severity}`	Exit non-zero if findings meet threshold (for CI/CD gating)

Usage:

# Unlimited — keep finding vulnerabilities until interrupted
/autoresearch:security

# Bounded — exactly 10 security sweep iterations
/autoresearch:security
Iterations: 10

# With focused scope
/autoresearch:security
Scope: src/api/**/*.ts, src/middleware/**/*.ts
Focus: authentication and authorization flows

# Delta mode — only audit changed files since last audit
/autoresearch:security --diff

# Auto-fix confirmed Critical/High findings after audit
/autoresearch:security --fix
Iterations: 15

# CI/CD gate — fail pipeline if any Critical findings
/autoresearch:security --fail-on critical
Iterations: 10

# Combined — delta audit + fix + gate
/autoresearch:security --diff --fix --fail-on critical
Iterations: 15

Inspired by:

Strix — AI-powered security testing with proof-of-concept validation
/plan red-team — adversarial review with hostile reviewer personas
OWASP Top 10 (2021) — industry-standard vulnerability taxonomy
STRIDE — Microsoft's threat modeling framework

/autoresearch:ship — Universal Shipping Workflow

Ship anything — code, content, marketing, sales, research, or design — through a structured 8-phase workflow that applies autoresearch loop principles to the last mile.

Load: references/ship-workflow.md for full protocol.

What it does:

Identify — auto-detect what you're shipping (code PR, deployment, blog post, email campaign, sales deck, research paper, design assets)
Inventory — assess current state and readiness gaps
Checklist — generate domain-specific pre-ship gates (all mechanically verifiable)
Prepare — autoresearch loop to fix failing checklist items until 100% pass
Dry-run — simulate the ship action without side effects
Ship — execute the actual delivery (merge, deploy, publish, send)
Verify — post-ship health check confirms it landed
Log — record shipment to ship-log.tsv for traceability

Supported shipment types:

Type	Example Ship Actions
`code-pr`	`gh pr create` with full description
`code-release`	Git tag + GitHub release
`deployment`	CI/CD trigger, `kubectl apply`, push to deploy branch
`content`	Publish via CMS, commit to content branch
`marketing-email`	Send via ESP (SendGrid, Mailchimp)
`marketing-campaign`	Activate ads, launch landing page
`sales`	Send proposal, share deck
`research`	Upload to repository, submit paper
`design`	Export assets, share with stakeholders

Flags:

Flag	Purpose
`--dry-run`	Validate everything but don't actually ship (stop at Phase 5)
`--auto`	Auto-approve dry-run gate if no errors
`--force`	Skip non-critical checklist items (blockers still enforced)
`--rollback`	Undo the last ship action (if reversible)
`--monitor N`	Post-ship monitoring for N minutes
`--type <type>`	Override auto-detection with explicit shipment type
`--checklist-only`	Only generate and evaluate checklist (stop at Phase 3)

Usage:

# Auto-detect and ship (interactive)
/autoresearch:ship

# Ship code PR with auto-approve
/autoresearch:ship --auto

# Dry-run a deployment before going live
/autoresearch:ship --type deployment --dry-run

# Ship with post-deployment monitoring
/autoresearch:ship --monitor 10

# Prepare iteratively then ship
/autoresearch:ship
Iterations: 5

# Just check if something is ready to ship
/autoresearch:ship --checklist-only

# Ship a blog post
/autoresearch:ship
Target: content/blog/my-new-post.md
Type: content

# Ship a sales deck
/autoresearch:ship --type sales
Target: decks/q1-proposal.pdf

# Rollback a bad deployment
/autoresearch:ship --rollback

Composite metric (for bounded loops):

ship_score = (checklist_passing / checklist_total) * 80
           + (dry_run_passed ? 15 : 0)
           + (no_blockers ? 5 : 0)

Score of 100 = fully ready. Below 80 = not shippable.

Output directory: Creates ship/{YYMMDD}-{HHMM}-{ship-slug}/ with checklist.md, ship-log.tsv, summary.md.

/autoresearch:scenario — Scenario-Driven Use Case Generator

Autonomous scenario exploration engine that generates, expands, and stress-tests use cases from a seed scenario. Discovers edge cases, failure modes, and derivative scenarios that manual analysis misses.

Load: references/scenario-workflow.md for full protocol.

What it does:

Seed Analysis — parse scenario, identify actors, goals, preconditions, components
Decomposition — break into 12 exploration dimensions (happy path, error, edge case, abuse, scale, concurrent, temporal, data variation, permission, integration, recovery, state transition)
Situation Generation — create one concrete situation per iteration from unexplored dimensions
Classification — deduplicate (new/variant/duplicate/out-of-scope/low-value)
Expansion — derive edge cases, what-ifs, failure modes from each kept situation
Logging — record to scenario-results.tsv with dimension, severity, classification
Repeat — pick next unexplored dimension/combination, iterate

Key behaviors:

Adaptive interactive setup: 4-8 questions based on how much context the user provides
12 exploration dimensions ensure comprehensive coverage
Domain-specific templates (software, product, business, security, marketing)
Every situation requires concrete trigger, flow, and expected outcome — no vague "something goes wrong"
Composite metric: scenarios_generated*10 + edge_cases_found*15 + (dimensions_covered/12)*30 + unique_actors*5
Creates scenario/{YYMMDD}-{HHMM}-{slug}/ with: scenarios.md, use-cases.md, edge-cases.md, scenario-results.tsv, summary.md

Flags:

Flag	Purpose
`--domain <type>`	Set domain (software, product, business, security, marketing)
`--depth <level>`	Exploration depth: shallow (10), standard (25), deep (50+)
`--scope <glob>`	Limit to specific files/features
`--format <type>`	Output: use-cases, user-stories, test-scenarios, threat-scenarios, mixed
`--focus <area>`	Prioritize dimension: edge-cases, failures, security, scale

Usage:

# Unlimited — keep exploring until interrupted
/autoresearch:scenario

# Bounded with context
/autoresearch:scenario
Scenario: User attempts checkout with multiple payment methods
Domain: software
Depth: standard
Iterations: 25

# Quick edge case scan
/autoresearch:scenario --depth shallow --focus edge-cases
Scenario: File upload feature for profile pictures

# Security-focused
/autoresearch:scenario --domain security
Scenario: OAuth2 login flow with third-party providers
Iterations: 30

# Generate test scenarios
/autoresearch:scenario --format test-scenarios --domain software
Scenario: REST API pagination with filtering and sorting

/autoresearch:predict — Multi-Persona Swarm Prediction

Multi-perspective code analysis using swarm intelligence principles. Simulates 3-5 expert personas (Architect, Security Analyst, Performance Engineer, Reliability Engineer, Devil's Advocate) that independently analyze code, debate findings, and reach consensus — all within Claude's native context. Zero external dependencies.

Load: references/predict-workflow.md for full protocol.

What it does:

Codebase Reconnaissance — scan files, extract entities, map dependencies into knowledge .md files
Persona Generation — create 3-5 expert personas from codebase context
Independent Analysis — each persona analyzes code from their unique perspective
Structured Debate — 1-2 rounds of cross-examination with mandatory Devil's Advocate dissent
Consensus — synthesizer aggregates findings with confidence scores + anti-herd check
Knowledge Output — write predict/ folder with codebase-analysis.md, dependency-map.md, component-clusters.md
Report — generate findings.md, hypothesis-queue.md, overview.md
Handoff — write handoff.json for optional --chain to debug/security/fix/ship/scenario

Key behaviors:

File-based knowledge representation: .md files ARE the knowledge graph, zero external deps
Git-hash stamping: every output embeds commit SHA for staleness detection
Incremental updates: only re-analyzes files changed since last run
Anti-herd mechanism: Devil's Advocate mandatory, groupthink detection via flip rate + entropy
Empirical evidence always trumps swarm prediction when chained with autoresearch loop
Composite metric: findings_confirmed*15 + findings_probable*8 + minority_preserved*3 + (personas/total)*20 + (rounds/planned)*10 + anti_herd_passed*5
Creates predict/{YYMMDD}-{HHMM}-{slug}/ folder with: overview.md, codebase-analysis.md, dependency-map.md, component-clusters.md, persona-debates.md, hypothesis-queue.md, findings.md, predict-results.tsv, handoff.json

Flags:

Flag	Purpose
`--chain <targets>`	Chain to tools. Single: `--chain debug`. Multi: `--chain scenario,debug,fix` (sequential)
`--personas N`	Number of personas (default: 5, range: 3-8)
`--rounds N`	Debate rounds (default: 2, range: 1-3)
`--depth <level>`	Depth preset: shallow (3 personas, 1 round), standard (5, 2), deep (8, 3)
`--adversarial`	Use adversarial persona set (Red Team, Blue Team, Insider, Supply Chain, Judge)
`--budget <N>`	Max total findings across all personas (default: 40)
`--fail-on <severity>`	Exit non-zero if findings at or above severity (for CI/CD)
`--scope <glob>`	Limit analysis to specific files

Usage:

# Standard analysis
/autoresearch:predict
Scope: src/**/*.ts
Goal: Find reliability issues

# Quick security scan
/autoresearch:predict --depth shallow --chain security
Scope: src/api/**

# Deep analysis with adversarial debate
/autoresearch:predict --depth deep --adversarial
Goal: Pre-deployment quality audit

# CI/CD gate
/autoresearch:predict --fail-on critical --budget 20
Scope: src/**
Iterations: 1

# Chain to debug for hypothesis-driven investigation
/autoresearch:predict --chain debug
Scope: src/auth/**
Goal: Investigate intermittent 500 errors

# Multi-chain: predict → scenario → debug → fix (sequential pipeline)
/autoresearch:predict --chain scenario,debug,fix
Scope: src/**
Goal: Full quality pipeline for new feature

/autoresearch:learn — Autonomous Codebase Documentation Engine

Scouts codebase structure, learns patterns and architecture, generates/updates comprehensive documentation — then validates and iteratively improves until docs match codebase reality.

Load: references/learn-workflow.md for full protocol.

What it does:

Scout — parallel codebase reconnaissance with scale awareness and monorepo detection
Analyze — project type classification, tech stack detection, staleness measurement
Map — dynamic doc discovery (docs/*.md), gap analysis, conditional doc selection
Generate — spawn docs-manager with structured prompt template and full context
Validate — mechanical verification (code refs, links, completeness, size compliance)
Fix — validation-fix loop: re-generate failed docs with feedback (max 3 retries)
Finalize — inventory check, git diff summary, size compliance
Log — record results to learn-results.tsv

4 Modes:

Mode	Purpose	Autoresearch Loop?
`init`	Learn codebase from scratch, generate all docs	Yes — validate-fix cycle
`update`	Learn what changed, refresh existing docs	Yes — validate-fix cycle
`check`	Read-only health/staleness assessment	No — diagnostic only
`summarize`	Quick codebase summary with file inventory	Minimal — size check only

Key behaviors:

Fully dynamic doc discovery — scans docs/*.md, no hardcoded file lists
State-aware mode detection — auto-selects init/update based on docs/ state
Project-type-adaptive — creates deployment-guide.md only if deployment config exists
Validation-fix loop capped at 3 retries — escalates to user if unresolved
Scale-aware scouting — adjusts parallelism for 5k+ file codebases
Composite metric: learn_score = validation%×0.5 + coverage%×0.3 + size_compliance%×0.2
Creates learn/{YYMMDD}-{HHMM}-{slug}/ with: learn-results.tsv, summary.md, validation-report.md, scout-context.md

Flags:

Flag	Purpose
`--mode <mode>`	Operation: init, update, check, summarize (default: auto-detect)
`--scope <glob>`	Limit codebase learning to specific dirs
`--depth <level>`	Doc comprehensiveness: quick, standard, deep
`--scan`	Force fresh scout in summarize mode
`--topics <list>`	Focus summarize on specific topics
`--file <name>`	Selective update — target single doc
`--no-fix`	Skip validation-fix loop
`--format <fmt>`	Output format: markdown (default). Planned: confluence, rst, html

Usage:

# Auto-detect mode and learn
/autoresearch:learn

# Initialize docs for new project
/autoresearch:learn --mode init --depth deep

# Update docs after changes
/autoresearch:learn --mode update
Iterations: 3

# Read-only health check
/autoresearch:learn --mode check

# Quick summary
/autoresearch:learn --mode summarize --scan

# Selective update of one doc
/autoresearch:learn --mode update --file system-architecture.md

# Scoped learning
/autoresearch:learn --scope src/api/**
Iterations: 5

/autoresearch:plan — Goal → Configuration Wizard

Converts a plain-language goal into a validated, ready-to-execute autoresearch configuration.

Load: references/plan-workflow.md for full protocol.

Quick summary:

Capture Goal — ask what the user wants to improve (or accept inline text)
Analyze Context — scan codebase for tooling, test runners, build scripts
Define Scope — suggest file globs, validate they resolve to real files
Define Metric — suggest mechanical metrics, validate they output a number
Define Direction — higher or lower is better
Define Verify — construct the shell command, dry-run it, confirm it works
Confirm & Launch — present the complete config, offer to launch immediately

Critical gates:

Metric MUST be mechanical (outputs a parseable number, not subjective)
Verify command MUST pass a dry run on the current codebase before accepting
Scope MUST resolve to ≥1 file

Usage:

/autoresearch:plan
Goal: Make the API respond faster

/autoresearch:plan Increase test coverage to 95%

/autoresearch:plan Reduce bundle size below 200KB

After the wizard completes, the user gets a ready-to-paste /autoresearch invocation — or can launch it directly.

When to Activate

User invokes /autoresearch → run the loop
User invokes /autoresearch:plan → run the planning wizard
User invokes /autoresearch:security → run the security audit
User says "help me set up autoresearch", "plan an autoresearch run" → run the planning wizard
User says "security audit", "threat model", "OWASP", "STRIDE", "find vulnerabilities", "red-team" → run the security audit
User invokes /autoresearch:ship → run the ship workflow
User says "ship it", "deploy this", "publish this", "launch this", "get this out the door" → run the ship workflow
User invokes /autoresearch:debug → run the debug loop
User says "find all bugs", "hunt bugs", "debug this", "why is this failing", "investigate" → run the debug loop
User invokes /autoresearch:fix → run the fix loop
User says "fix all errors", "make tests pass", "fix the build", "clean up errors" → run the fix loop
User invokes /autoresearch:scenario → run the scenario loop
User says "explore scenarios", "generate use cases", "what could go wrong", "stress test this feature", "edge cases for" → run the scenario loop
User invokes /autoresearch:learn → run the learn workflow
User says "learn this codebase", "generate docs", "document this project", "create documentation", "update docs", "check docs", "docs health" → run the learn workflow
User invokes /autoresearch:predict → run the predict workflow
User says "predict", "multi-perspective", "swarm analysis", "what do multiple experts think", "analyze from different angles" → run the predict workflow
User says "work autonomously", "iterate until done", "keep improving", "run overnight" → run the loop
Any task requiring repeated iteration cycles with measurable outcomes → run the loop

Bounded Iterations

By default, autoresearch loops forever until manually interrupted. To run exactly N iterations, add Iterations: N to your inline config.

Unlimited (default):

/autoresearch
Goal: Increase test coverage to 90%

Bounded (N iterations):

/autoresearch
Goal: Increase test coverage to 90%
Iterations: 25

After N iterations Claude stops and prints a final summary with baseline → current best, keeps/discards/crashes. If the goal is achieved before N iterations, Claude prints early completion and stops.

When to Use Bounded Iterations

Scenario	Recommendation
Run overnight, review in morning	Unlimited (default)
Quick 30-min improvement session	`Iterations: 10`
Targeted fix with known scope	`Iterations: 5`
Exploratory — see if approach works	`Iterations: 15`
CI/CD pipeline integration	`--iterations N` flag (set N based on time budget)

Setup Phase (Do Once)

If the user provides Goal, Scope, Metric, and Verify inline → extract them and proceed to step 5.

CRITICAL: If ANY critical field is missing (Goal, Scope, Metric, Direction, or Verify), you MUST use AskUserQuestion to collect them interactively. DO NOT proceed to The Loop or any execution phase without completing this setup. This is a BLOCKING prerequisite.

Interactive Setup (when invoked without full config)

Scan the codebase first for smart defaults, then ask ALL questions in batched AskUserQuestion calls (max 4 per call). This gives users full clarity upfront.

Batch 1 — Core config (4 questions in one call):

Use a SINGLE AskUserQuestion call with these 4 questions:

#	Header	Question	Options (smart defaults from codebase scan)
1	`Goal`	"What do you want to improve?"	"Test coverage (higher)", "Bundle size (lower)", "Performance (faster)", "Code quality (fewer errors)"
2	`Scope`	"Which files can autoresearch modify?"	Suggested globs from project structure (e.g. "src/*/.ts", "content/*/.md")
3	`Metric`	"What number tells you if it got better? (must be a command output, not subjective)"	Detected options: "coverage % (higher)", "bundle size KB (lower)", "error count (lower)", "test pass count (higher)"
4	`Direction`	"Higher or lower is better?"	"Higher is better", "Lower is better"

Batch 2 — Verify + Guard + Launch (3 questions in one call):

#	Header	Question	Options
5	`Verify`	"What command produces the metric? (I'll dry-run it to confirm)"	Suggested commands from detected tooling
6	`Guard`	"Any command that must ALWAYS pass? (prevents regressions)"	"npm test", "tsc --noEmit", "npm run build", "Skip — no guard"
7	`Launch`	"Ready to go?"	"Launch (unlimited)", "Launch with iteration limit", "Edit config", "Cancel"

After Batch 2: Dry-run the verify command. If it fails, ask user to fix or choose a different command. If it passes, proceed with launch choice.

IMPORTANT: You MUST call AskUserQuestion with batched questions — never ask one at a time, and never skip this step. Users should see all config choices together for full context. DO NOT proceed to Setup Steps or The Loop without completing interactive setup.

Setup Steps (after config is complete)

Read all in-scope files for full context before any modification
Define the goal — extracted from user input or inline config
Define scope constraints — validated file globs
Define guard (optional) — regression prevention command
Create a results log — Track every iteration (see references/results-logging.md)
Establish baseline — Run verification on current state AND guard (if set). Record as iteration #0
Confirm and go — Show user the setup, get confirmation, then BEGIN THE LOOP

The Loop

Read references/autonomous-loop-protocol.md for full protocol details.

LOOP (FOREVER or N times):
  1. Review: Read current state + git history + results log
  2. Ideate: Pick next change based on goal, past results, what hasn't been tried
  3. Modify: Make ONE focused change to in-scope files
  4. Commit: Git commit the change (before verification)
  5. Verify: Run the mechanical metric (tests, build, benchmark, etc.)
  6. Guard: If guard is set, run the guard command
  7. Decide:
     - IMPROVED + guard passed (or no guard) → Keep commit, log "keep", advance
     - IMPROVED + guard FAILED → Revert, then try to rework the optimization
       (max 2 attempts) so it improves the metric WITHOUT breaking the guard.
       Never modify guard/test files — adapt the implementation instead.
       If still failing → log "discard (guard failed)" and move on
     - SAME/WORSE → Git revert, log "discard"
     - CRASHED → Try to fix (max 3 attempts), else log "crash" and move on
  8. Log: Record result in results log
  9. Repeat: Go to step 1.
     - If unbounded: NEVER STOP. NEVER ASK "should I continue?"
     - If bounded (N): Stop after N iterations, print final summary

Critical Rules

Loop until done — Unbounded: loop until interrupted. Bounded: loop N times then summarize.
Read before write — Always understand full context before modifying
One change per iteration — Atomic changes. If it breaks, you know exactly why
Mechanical verification only — No subjective "looks good". Use metrics
Automatic rollback — Failed changes revert instantly. No debates
Simplicity wins — Equal results + less code = KEEP. Tiny improvement + ugly complexity = DISCARD
Git is memory — Every experiment committed with experiment: prefix. Use git revert (not git reset --hard) for rollbacks so failed experiments remain visible in history. Agent MUST read git log and git diff of kept commits to learn patterns before each iteration
When stuck, think harder — Re-read files, re-read goal, combine near-misses, try radical changes. Don't ask for help unless truly blocked by missing access/permissions

Principles Reference

See references/core-principles.md for the 7 generalizable principles from autoresearch.

Adapting to Different Domains

Domain	Metric	Scope	Verify Command	Guard
Backend code	Tests pass + coverage %	`src/*/.ts`	`npm test`	—
Frontend UI	Lighthouse score	`src/components/**`	`npx lighthouse`	`npm test`
ML training	val_bpb / loss	`train.py`	`uv run train.py`	—
Blog/content	Word count + readability	`content/*.md`	Custom script	—
Performance	Benchmark time (ms)	Target files	`npm run bench`	`npm test`
Refactoring	Tests pass + LOC reduced	Target module	`npm test && wc -l`	`npm run typecheck`
Security	OWASP + STRIDE coverage + findings	API/auth/middleware	`/autoresearch:security`	—
Shipping	Checklist pass rate (%)	Any artifact	`/autoresearch:ship`	Domain-specific
Debugging	Bugs found + coverage	Target files	`/autoresearch:debug`	—
Fixing	Error count (lower)	Target files	`/autoresearch:fix`	`npm test`
Scenario analysis	Scenario coverage score (higher)	Feature/domain files	`/autoresearch:scenario`	—
Scenarios	Use cases + edge cases + dimension coverage	Target feature/files	`/autoresearch:scenario`	—
Prediction	Findings + hypotheses (higher)	Target files	`/autoresearch:predict`	—
Documentation	Validation pass rate (higher)	`docs/*.md`	`/autoresearch:learn`	`npm test`

Adapt the loop to your domain. The PRINCIPLES are universal; the METRICS are domain-specific.