name: skill-optimizer version: 0.1.0 description: Self-evolving skill optimization via SkillOpt-paper-grounded text-space optimizer. triggers: - "optimize this skill" - "tune the skill against the benchmark" - "make the skill better" - "run skillopt" - "skillopt for" mutating: true brain_first: exempt
Skill Optimizer
Self-evolving skill optimization. Treats SKILL.md as the trainable parameters of a frozen agent. Validation-gated, budget-capped, atomic-versioned.
Based on SkillOpt (arXiv 2605.23904, Microsoft Research, May 2026).
When to invoke this skill
The user wants to:
- Improve an existing skill's execution quality against a benchmark
- Bootstrap a benchmark file for a new skill
- Re-tune a skill after switching target models
Iron Law
- Validation gating is MANDATORY. Every candidate must clear median-of-3
- epsilon=0.05 margin against the sel-set before SKILL.md gets rewritten.
- Frontmatter mutation is FORBIDDEN. The optimizer only edits the body.
Routing surface (
triggers:,brain_first:) stays invariant. - Bundled skills require explicit opt-in AND an independent held-out set.
Skills shipping with gbrain cannot be auto-mutated. To rewrite one in place
the user passes BOTH
--allow-mutate-bundledAND--held-out <path>with at least 5 benchmark-disjoint tasks; without the held-out set the run hard-refuses (exit 2). Drop--allow-mutate-bundled(or pass--no-mutate, the default for the dream-cycle phase) to write proposed.md for review instead — no held-out needed for review-only output. - Bootstrap output requires human review. Both
--bootstrap-from-skilland--bootstrap-from-routingwrite a sentinel; you must review + STRENGTHEN the generated judges, delete the sentinel, and re-run with--bootstrap-reviewedbefore optimization can use the file.
The pipeline
gbrain skillopt <skill-name> [flags]
│
├── Pre-flight gates
│ ├── working tree clean (or --force)
│ ├── benchmark valid + D_sel >= 5 (D17)
│ ├── cost preflight (D3) — refuses over --max-cost-usd
│ └── per-skill DB lock (D14)
│
├── Baseline eval on D_sel (sets best_sel_score)
│
├── for epoch in 1..N:
│ for step in 1..steps_per_epoch:
│ ├── forward pass: rollouts on D_train batch
│ ├── backward pass: reflect × 2 (failures + successes per D7)
│ ├── rank + clip via LR cosine schedule
│ ├── apply edits (body-only per D5, tagged result per D9)
│ ├── validation gate: median-of-3 + epsilon=0.05 (D12)
│ └── if accept: commit via D8 history-intent-first
│ │
│ └── slow update (D6) if no improvement this epoch
│
└── Final test eval on D_test → run receipt
Starting a benchmark from the skill itself (the common case)
The user will NOT hand-write a benchmark, and you shouldn't start from a blank
file either. When the user says "make skill X better" and
skills/X/skillopt-benchmark.jsonl doesn't exist, generate a starter from the
SKILL.md directly:
- Generate the starter. Run:
One LLM call readsgbrain skillopt X --bootstrap-from-skillskills/X/SKILL.md, infers what the skill produces and what "good" looks like, and writes ~15 tasks (each with rule judges) toskills/X/skillopt-benchmark.jsonlplus a# BOOTSTRAP_PENDING_REVIEWsentinel. Norouting-eval.jsonlis needed. Tune the count with--bootstrap-tasks N(max 50). - Review AND STRENGTHEN the judges. This is YOUR job and it is load-bearing.
The generated rule checks are weak drafts — the model tends to emit generic
contains, loosemax_chars, or invented headings. Read each task, fix soft checks, add the must-haves the skill actually requires (real section names, real length ceilings,min_citationswhere sources are expected,tool_called/tool_not_calledfor tools the skill genuinely uses). A thin benchmark optimizes for a thin definition of quality — do not rubber-stamp. - Delete the sentinel line (
# BOOTSTRAP_PENDING_REVIEW, the last line). - Run the optimizer with
--split 1:1:1:
The 1:1:1 split is REQUIRED for a 15-task starter — the defaultgbrain skillopt X --bootstrap-reviewed --split 1:1:14:1:5makes the validation setfloor(15/10)=1, below theD_sel >= 5floor, and the optimizer refuses withd_sel_too_small. (4:1:5 needs ~50 tasks.) Add--dry-runfirst to preview cost.
Benchmark line shape (what the generator writes, one per line):
{"task_id":"x-001","task":"<user prompt>","judge":{"kind":"rule","checks":[{"op":"max_chars","arg":1800},{"op":"contains","arg":"agenda"}]}}
Rule-check vocabulary you'll strengthen with: contains, regex,
section_present, max_chars, min_citations, tool_called, tool_not_called.
Rule judges are deterministic and free, but shallow for skills whose quality is
sequencing, privacy, refusal boundaries, or file placement — for those, hand-add
richer checks (or an llm judge) during review.
Fallback — author freehand. If the generated starter is poor (rare, but
possible for very behavior-shaped skills), discard it and write the JSONL
yourself: read the SKILL.md, write ~15 realistic tasks covering the boring middle,
attach >=2 rule checks each, save to skills/X/skillopt-benchmark.jsonl, run with
--split 1:1:1. The human walkthrough lives at
docs/tutorials/improving-skills-with-skillopt.md.
Decision tree
| Situation | Action |
|---|---|
| Skill has no benchmark | gbrain skillopt foo --bootstrap-from-skill → review + strengthen the judges → delete sentinel → gbrain skillopt foo --bootstrap-reviewed --split 1:1:1 (see section above) |
Skill has a routing-eval.jsonl and you want a head start |
gbrain skillopt foo --bootstrap-from-routing → review the generated tasks → --bootstrap-reviewed (routing tasks test dispatch; tighten them into quality tasks before trusting) |
| Iterating on an existing skill | gbrain skillopt foo --benchmark skills/foo/skillopt-benchmark.jsonl |
| Costly run, want preview | Add --dry-run |
| Bundled skill (skills/ in gbrain repo) | Default writes proposed.md; to commit in place add --allow-mutate-bundled AND --held-out <path> (>=5 benchmark-disjoint tasks) — else it hard-refuses |
| Want to review changes before applying | Add --no-mutate (writes proposed.md, no held-out needed) |
| Guard against benchmark overfitting | Add --held-out <path> — a candidate that beats the benchmark but regresses on the held-out set is refused |
| Mid-run crash | gbrain skillopt foo --resume <run-id> |
Output Format
When invoked, this skill produces:
- Updated
skills/<name>/SKILL.md(when mutation is allowed) skills/<name>/skillopt/best.md— pointer copy of current bestskills/<name>/skillopt/versions/vNNNN_eN_sN.md— per-step snapshotsskills/<name>/skillopt/history.json— append-only run recordskills/<name>/skillopt/rejected.json— bounded LRU of rejected edits~/.gbrain/audit/skillopt-YYYY-Www.jsonl— ISO-week-rotated audit trail
Anti-Patterns
- Don't bypass the validation gate. The median-of-3 + epsilon=0.05 is load-bearing; without it, the optimizer accepts noise as improvement.
- Don't optimize bundled skills without
--allow-mutate-bundledAND--held-out. They ship with gbrain and are load-bearing for downstream agents. In-place mutation requires both flags (held-out >=5 benchmark-disjoint tasks); without the held-out set the run hard-refuses and points you at proposed.md. - Don't use bootstrap output without strengthening it. Both
--bootstrap-from-skilland--bootstrap-from-routinghave the optimizer model invent success criteria — generic and weak by default. Review and tighten the judges before SkillOpt optimizes against them, or it trains the skill toward benchmark artifacts instead of real quality. - Don't skip
--split 1:1:1on a ~15-task starter. The default4:1:5split drops the validation set below theD_sel >= 5floor and the run aborts withd_sel_too_small.
Contract
runSkillOpt(opts) returns:
{
outcome: 'accepted' | 'no_improvement' | 'aborted' | 'errored',
receipt: {
run_id, skill_sha8, benchmark_sha8, models, cost,
baseline_sel_score, best_sel_score, // real measured baseline (no longer hardcoded 0)
baseline_test_score, test_score, // final held-out test-split eval
},
finalText: string,
mutatedSkillFile: boolean,
proposedPath?: string
}
Related skills
skillify— scaffolds a new skill (use BEFORE skillopt)skillpack-check— audits skill conformance (item 13 surfaces skillopt status)conventions/quality.md— output quality standards skillopt enforces via judges