name: skill-optimizer description: Improve, debug, benchmark, or refactor an existing Agent Skill from conversation evidence, execution traces, user corrections, eval failures, or target skill files. Use this skill whenever the user asks to optimize, harden, generalize, validate, benchmark, package, or turn observed behavior into durable skill changes. Produces evidence-based diagnosis, reviewable patches, trigger evals, validation cases, and safe next-run behavior; do not use it to perform the target skill's normal task.
Skill Optimizer
Mission
Turn real usage evidence into safer, more reliable, easier-to-trigger, and easier-to-evaluate Agent Skills.
This is a meta-skill. It does not merely rewrite prose. It analyzes the fit between a skill's purpose, trigger description, inputs, procedure, tools, outputs, risks, examples, and evaluations, then proposes small reviewable changes.
Use when
Use this skill when the user asks to improve, optimize, debug, refactor, benchmark, validate, generalize, harden, package, or document another skill. Also use it when the user says the current conversation should be captured into a skill, or that a previous skill run exposed something that should happen differently next time.
Do not use this skill for ordinary task execution. If the user asks to run the release skill, do not optimize the release skill unless they ask to improve it.
Modes
- Diagnose: identify what should change without writing a patch.
- Patch: produce a reviewable diff, replacement section, or revised
SKILL.md. - Validate: create validation cases and failure-mode checks.
- Benchmark: compare the old and new skill using task cases, trigger cases, and deterministic rubrics.
- Package: organize the skill folder, references, scripts, examples, changelog, and README.
Required inputs
Infer these from the conversation before asking follow-up questions:
- target skill name and purpose
- current
SKILL.mdand supporting files, if available - execution evidence: user request, tool use, output, corrections, mistakes, delays, or surprises
- environment: chat, code agent, workspace agent, API, or another harness
- intended optimization mode
- risk level and write-action authority
- sequencing constraints when optimization is requested after another live task
If the target skill file is unavailable, do not fabricate an exact diff. Produce an inferred improvement plan, draft replacement sections, validation cases, and assumptions.
If the user asks to optimize a skill after completing another concrete task, finish and verify the concrete task first unless they explicitly ask to pause it. Then optimize from the observed evidence. Do not interrupt the user's primary workflow just because this skill is mentioned.
Universal optimization lens
Analyze every target skill through these lenses:
- Purpose and scope: what job the skill owns, who it serves, and what it must not do.
- Triggering and boundaries: description quality, should-trigger cases, near-negative cases, competing skills, and under/over-triggering risks.
- Inputs and assumptions: required inputs, source of truth, missing-data behavior, units, locale, time horizon, and user preferences.
- Workflow and decision rules: ordered steps, branch conditions, heuristics, stop conditions, escalation paths, and exception handling.
- Tools and authority: required tools, permissions, external writes, approvals, dry runs, and exact operations.
- Outputs and interfaces: templates, file formats, citations, links, machine-readable fields, handoff artifacts, and user-facing summaries.
- Quality bar and evaluation: success criteria, deterministic verifiers, examples, regression tests, trigger evals, and human review points.
- Safety, privacy, and policy: sensitive data, regulated advice, consent, audit trail, access control, retention, and harmful misuse.
- Failure and recovery: blocked states, retries, rollback, partial completion, cleanup, and user-visible status.
- Maintainability: concise instructions, bundled resources, changelog, version notes, known limitations, and portability across harnesses.
Use references/universal-optimization-lens.md when a deeper diagnosis is needed.
Evidence rules
Treat the current conversation as evidence, not as a script to memorize.
For each important observation, capture:
- Evidence: what happened or what the user corrected.
- Root cause: why the current skill allowed it.
- Durable change: what should be added, removed, or clarified.
- Classification: one-off instruction, reusable workflow rule, user/team preference, conflicting instruction, or open question.
Do not overfit one unusual task into a permanent rule. Convert it into a reusable rule only when it improves future behavior.
Treat strong user corrections as high-signal evidence. Phrases like "no", "not this", "wrong direction", "first principles", or a correction from a component-level answer to a user-scenario answer usually indicate a framing failure, not just a missing detail. Capture:
- the wrong abstraction level the previous run optimized for
- the user's intended source of truth
- what should have been downstream evidence rather than the starting point
- how the target skill should avoid the same drift next time
When the evidence comes from a multi-step execution, include the verification results, not only the final prose. A durable skill change should be grounded in what was requested, what was attempted, what was corrected, and what was ultimately validated.
Framing and abstraction checks
Before proposing a patch, ask whether the target skill optimized the right object:
- User outcome vs. UI surface: a feature or component may be only a view over a deeper workflow.
- Scenario spine vs. fixture rows: realistic data should come from causal user activity, not isolated screen states.
- Source of truth vs. derivative signal: logs, runs, issues, costs, or decisions may be primary; dashboards, calendars, and summaries may be downstream.
- Product intent vs. local convenience: read available product, requirement, or reference docs before encoding a domain rule from one conversation.
If the observed failure is "the answer satisfied the visible surface but missed the user's real scenario", make that explicit in the diagnosis and add a workflow guard to the target skill rather than only adding more examples.
Patch rules
Prefer small, auditable patches over broad rewrites. Preserve the target skill's identity, useful examples, and safety constraints.
A patch may change:
- frontmatter description and trigger boundaries
- required inputs and assumptions
- workflow steps and decision rules
- output templates
- safety and approval requirements
- failure handling
- examples and references
- validation cases and benchmark tasks
Never silently weaken safety requirements. For write actions, publishing, financial actions, medical or legal consequences, hiring decisions, external communication, deletion, deployment, migration, or permissions changes, require explicit authority unless the target skill already has a clear safe policy.
Trigger optimization
The frontmatter description is the primary discovery signal. After improving a skill, evaluate whether the description should change.
Create trigger evals with:
- realistic should-trigger queries
- realistic should-not-trigger near misses
- ambiguous cases where another skill might be more appropriate
- casual phrasing, typos, file paths, role context, and domain language
Optimize for accurate triggering, not maximum triggering.
Domain adaptation
This skill is domain-general. Do not bake one domain's checklist into the core instructions.
When the domain matters, attach or consult a short domain adapter. A good adapter names:
- source of truth
- required inputs
- review owner
- consequential actions and approval gates
- privacy, confidentiality, or consent constraints
- output template
- validation cases and deterministic checks
- must-not behaviors
Use the transcript to extract observed domain markers, but do not encode hidden rubric terms or unrelated best practices as mandatory rules. Keep adapters modular so the optimizer can handle software, healthcare operations, law, finance, education, research, HR, customer support, operations, creative work, personal productivity, and other workflows.
Use references/domain-adapter-patterns.md when building or selecting an adapter. If a matching file exists under references/adapters/, consult it as a compact checklist rather than copying it wholesale into the target skill.
Validation format
Every meaningful behavior change needs at least one validation case.
Use this format:
### Case: <name>
Input:
...
Expected behavior:
...
Must not:
...
Include at least one normal case, one edge case, and one regression case when the change could break prior behavior.
Benchmark reporting
When running evals, separate three scores:
- Trigger accuracy: whether Skill Optimizer should activate.
- Patch-quality coverage: whether the proposed change includes evidence, scope classification, patch, safety, outputs, and validation.
- Downstream transfer: whether the optimized target skill actually improves on its own task suite.
Label synthetic verifier scores as synthetic. Do not report them as official benchmark or leaderboard results.
Packaging expectations
When packaging a skill or skill optimizer project, include:
SKILL.mdand supporting references- README with purpose, installation, usage, eval, limitations, and license
- changelog and version note
- examples of target skills and optimization outputs
- eval cases or a lightweight benchmark harness when available
- a distributable zip that contains exactly one skill folder for installation
Final response contract
Return these sections unless the user asks for a narrower result:
- Target skill and optimization mode
- Diagnosis summary
- Evidence ledger
- Improvement categories
- Proposed patch or revised skill draft
- Trigger eval suggestions when discovery may change
- Validation cases
- Benchmark or eval result, if run
- Assumptions, conflicts, and unresolved questions
When direct file editing is available and the user explicitly requested edits, apply the patch. Otherwise present a reviewable patch.
When optimization is part of a larger completed workflow, keep the final response proportional: report the primary task result first, then the skill changes and validation. Do not force the full nine-section contract if it would obscure the work the user was actually trying to finish.
Quality bar
A successful optimization makes the next run of the target skill more predictable, easier to trigger correctly, safer around irreversible actions, clearer in output, and easier to verify.