iterate-skill

star 6

Run the Logic-Lens skill-improvement loop end to end — baseline → diagnose failures → edit → sync cache → re-eval → verify net gain → iterate until clean. Use whenever the goal is to RAISE a skill's eval score or fix a failing eval mode: "improve logic-review", "the format compliance is failing, fix it", "iterate on this skill until the evals pass", "raise the score", "re-run the loop on the latest failures", "run another iteration", "tune the skill description / disambiguation table against the evals". Also use to RESUME a prior loop ("continue improving from where we left off", "do another pass", "iterate further"). Do NOT use for: a one-off question about a skill, shipping a release (use bump-version), or scaffolding a brand-new skill (use new-skill).

hyhmrright By hyhmrright schedule Updated 6/1/2026

name: iterate-skill description: Run the Logic-Lens skill-improvement loop end to end — baseline → diagnose failures → edit → sync cache → re-eval → verify net gain → iterate until clean. Use whenever the goal is to RAISE a skill's eval score or fix a failing eval mode: "improve logic-review", "the format compliance is failing, fix it", "iterate on this skill until the evals pass", "raise the score", "re-run the loop on the latest failures", "run another iteration", "tune the skill description / disambiguation table against the evals". Also use to RESUME a prior loop ("continue improving from where we left off", "do another pass", "iterate further"). Do NOT use for: a one-off question about a skill, shipping a release (use bump-version), or scaffolding a brand-new skill (use new-skill). disable-model-invocation: true

iterate-skill — the Logic-Lens improvement loop

This is the harness orchestrator. It coordinates three agents and two support skills into one deterministic loop that raises a skill's eval score without overfitting or grader-gaming.

Execution mode: sub-agent pipeline (generate → test → verify). Each step's output is the next step's input, handed off through the filesystem (skills-workspace/iteration-<TAG>/) and agent return values. There is no peer-to-peer team chatter, so agents are spawned via the Agent tool — always with model: "opus", and subagent_type set to the agent's own definition name (e.g. subagent_type: "skill-editor"), not "general-purpose" — passing general-purpose would discard the role/principles in .claude/agents/<name>.md, defeating the harness. Agents return results to this orchestrator, which owns state and the ship/rollback decisions.

Agents (who) — all in .claude/agents/, spawn by these exact subagent_type names:

subagent_type Role
eval-failure-analyzer Read-only: cluster failures, map to eval IDs, propose minimal edits
skill-editor Apply one minimal, generalized edit; refuse grader edits
iteration-guard Verify net gain vs variance; recommend SHIP / ROLLBACK / RERUN (orchestrator executes any revert)

Support skills (how): sync-skill-cache (mandatory pre-eval gate), run-iteration-eval (run + grade).

Phase 0 — context check (initial / resume / partial)

Determine the run mode before doing anything:

  • ls -dt skills-workspace/iteration-*/ — if recent iterations exist and the user asks to "continue" or "another pass" → resume: use the latest as baseline, skip re-baselining.
  • User provides a fresh target skill / new failure → initial: establish a baseline first (Phase 1).
  • User asks to redo just one mode or one skill → partial: scope the eval to the affected case IDs.

Confirm the target skill (which of the six logic-*) and the failing mode with the user if ambiguous — do not guess which skill to mutate.

Phase 1 — baseline

If no usable baseline exists for the target: run the run-iteration-eval skill (sync cache, then a full or mode-scoped run) to get summary.json. This is the number every later iteration is judged against. Record its TAG.

Phase 2 — diagnose

Spawn eval-failure-analyzer (Agent, model: "opus") pointed at the baseline iteration dir. It returns the prioritized failure modes, the exact failing eval IDs, and concrete edit proposals. Pick the single highest (failure-count × ease-of-fix) mode for this iteration. One mode per iteration — batching edits makes the verify step unable to attribute a regression.

Phase 3 — edit

Spawn skill-editor (Agent, model: "opus") with the chosen proposal. It applies one minimal, generalized edit and reports what it touched + its risk note. If it refuses (the proposal needs a grader/assertion change), drop that proposal and pick another mode — never relax the grader.

Phase 4 — sync cache (gate)

Run sync-skill-cache. If it reports DRIFT or a missing cache, stop the loop and surface it — an unsynced eval grades stale content and wastes the run. Do not proceed to Phase 5 until it prints OK.

Phase 5 — re-eval

Run run-iteration-eval scoped to the affected mode's case IDs (cheap) for a fast read; widen to a full run before a final SHIP decision. New summary.json, new TAG.

Phase 6 — verify

Spawn iteration-guard (Agent, model: "opus") with the baseline and candidate iteration dirs + the editor's risk note. Act on its verdict:

  • SHIP → keep the edit; the candidate becomes the new baseline.
  • ROLLBACK → revert the edit (it named which one); baseline unchanged.
  • RERUN → the move is inside variance; rerun the affected cases 2–3× (Phase 5) and re-verify before deciding.

Phase 7 — iterate or report

If SHIP and more modes remain and the score isn't at target → loop back to Phase 2 on the next mode. Otherwise produce the 迭代报告 (in 简体中文): baseline→final overall + logic/format subscores, the per-iteration Fix Log (mode, edit, verdict), and any mode left unresolved with why.

After reporting, offer Phase 7 evolution (harness skill): if the same failure mode recurs across loops, or the editor keeps refusing the same proposal, propose a harness change (a new disambiguation rule in the editor's principles, a new agent) and log it in CLAUDE.md's 변경 이력.

Data-passing protocol

  • File-based (durable handoff): all run artifacts live in skills-workspace/iteration-<TAG>/; agents read these dirs directly. Never delete a prior iteration dir — it is the rollback reference and the audit trail.
  • Return-value based (control flow): each agent returns its report to this orchestrator, which decides the next step. Agents do not call each other.

Error handling

1-retry then proceed-with-note. Specifically:

  • Cache sync fails → hard stop (never eval stale content). Report and fix the cache, don't skip.
  • An eval case errors (claude call fails) → the runner isolates it; re-run just that case once, then proceed and note the missing case in the report rather than blocking the whole loop.
  • Guard says RERUN repeatedly (persistent variance) → report the move as "within noise floor, inconclusive" rather than forcing a SHIP/ROLLBACK; widen the case set instead of trusting one run.
  • Conflicting signals (logic up, format down) → do not average them away; report both subscores with their sources and let the user weigh, per the project's "logic = primary / format = gate" split.

테스트 시나리오

정상 흐름: User: "logic-review 的 format compliance 一直挂,迭代修一下。" → Phase 0 initial → Phase 1 baseline (format subscore low) → analyzer flags the format-label mode + eval IDs → editor sharpens the literal label in the SKILL.md skeleton → sync OK → re-eval affected cases → guard: format up, logic flat, no collateral → SHIP → report with before/after subscores.

에러 흐름: Editor's proposed fix requires loosening a grader assertion → editor refuses in Phase 3 → orchestrator drops that proposal, picks the next mode from the analyzer's list, and continues — the grader is never touched. If instead the cache sync prints DRIFT in Phase 4, the loop halts and reports the drift before spending any eval tokens.

Install via CLI
npx skills add https://github.com/hyhmrright/logic-lens --skill iterate-skill
Repository Details
star Stars 6
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator