name: agent-rules description: Use when you operate coding-agent infrastructure with a feedback signal — eval suites, run-level telemetry, A/B baselines, or a skill catalog under test — and want to capture observed agent failures in a per-file reflection log, promote recurring patterns into AGENTS.md rules, hooks, or CI gates via the three-entry floor, and score Level 4-5 (Specification Architecture, Sovereign Engineering) maturity. Triggers on 'reflection log', 'agent failure log', 'promote this pattern into a rule', 'post-incident eval case', 'harness maturity Level 4 or 5', 'set up a feedback loop for agent failures'. Do NOT use for first-pass agent-readiness scaffolding when no feedback signal exists yet (use agent-readiness), or to instrument an AI product's eval and optimization loops (use agent-evals). license: MIT
Evidence-Driven Agent Rules
Capture observed agent failures, promote the recurring ones into rules, and score
advanced (Level 4–5) agent-readiness. For repos with a feedback signal: eval suites,
run-level telemetry, A/B baselines, or a skill catalog under test. No signal yet — no
benchmark, no telemetry baseline, no "we watched the agent do X" stream? Use
agent-readiness alone; promotion here needs evidence to drive it.
Produces: capture writes the reflection-log scaffold (docs/reflection-log/README.md,
_template.md, and a repo-root README.md §Agents pointer); promote proposes the
smallest rule / hook / CI-gate diff plus the entries it cites; assess-l4l5 scores
Level 4–5 maturity with a ceiling and prioritized gaps. Tracked assess-l4l5 runs also
emit agent-rules-findings-ledger-<date>-<slug>.md and agent-rules-workflow-state-<date>-<slug>.json.
Core principle
Single observations are cheap to record; rules cost trust to ship. Recording is
low-friction and always worth it once you can write a non-trivial What to do
differently line. Promotion — turning logged entries into an AGENTS.md rule, hook, or
CI gate — is gated on three or more entries describing the same gap, because
scaffolding from one observation overfits to plausible boilerplate (W1, Mündler et al.,
arXiv:2602.11988, Feb 2026: autogenerated context files drop task success ~3% and inflate
cost >20%). The recording bar and the promotion bar are deliberately different; the
earlier single-file log conflated them and agents self-filtered entries worth keeping.
When in doubt, record. promote searches later.
Quickstart
Begin with capture to scaffold the log:
Run capture: scaffold the reflection log for this repo.
It loads references/playbooks/reflection-log.md, writes docs/reflection-log/README.md
and _template.md, and adds a repo-root README.md §Agents pointer so the log is
discoverable from an always-loaded surface. Precondition: if AGENTS.md exists but
does not link the log, capture refuses. Add one line under its references section and
re-run:
- [docs/reflection-log/](./docs/reflection-log/) — per-failure entries; rules in this file may cite them.
promote and assess-l4l5 come later: they assume Stage 0 (this log) and Stage 1
(AGENTS.md from agent-readiness) are already in place. Full staging in
references/bootstrap-order.md.
Activation
- Bare invocation (
"set up reflection log","use agent-rules"): show the intent menu (capture/promote/assess-l4l5) inline and wait. No file inspection, network calls, or writes. This skill's scope is narrow enough that the SKILL.md body is the router — no separate CSV. - Concrete invocation with intent inferable: skip to Workflow step 2.
- Concrete invocation with ambiguous scope: ask one blocker question to fix the intent first; do not inspect private systems before then.
Workflow
- Pick intent.
capture(scaffold the log + README pointer),promote(find three same-gap entries and propose the rule/hook/gate that closes them), orassess-l4l5(score Level 4–5 maturity). Ambiguous → ask once. - Load context. Always:
references/empirical-warnings-w1.mdandreferences/playbooks/reflection-log.md. Forpromote, read the whole log (docs/reflection-log/[0-9]*.md); if the closing change is a hook, CI check, static validator, or branch-protection rule, also loadreferences/playbooks/gate-hardening.md. Forassess-l4l5, also loadreferences/core/maturity-rubric.md. - capture. Scaffold
docs/reflection-log/README.mdand_template.mdfromtemplates/artifacts/reflection-log/; add theREADME.md §Agentspointer if absent; refuse ifAGENTS.mdexists without a link to the log. - promote. Group entries by
sub-surface:(grep -l 'sub-surface: <name>' docs/reflection-log/[0-9]*.md). For any group of three or more, present the entries, propose the smallest closing change, and confirm before writing. Refuse below the floor. For a gate, require the gate-hardening variant matrix and a regression fixture before calling it done. - assess-l4l5. Assume Levels 1–3 are already scored by
agent-readiness(confirm with the user); score Levels 4–5 against the rubric; report ceiling and gaps with stable IDs (ED-L4L5-NNN) perreferences/trackable-findings.md. - Emit, citing the entries you drew on (filenames plus their What to do
differently lines):
capture→ files written, the §Agents pointer diff, a validation checklist.promote→ the proposed rule/hook/gate, the three-plus entries that justify it, and the AGENTS.md/hook diff for confirmation.assess-l4l5→ maturity score per layer, ceiling, prioritized gaps.
- Create tracking state. For
assess-l4l5with 7+ gaps, a level-ceiling blocker, or a save/track request, write both: a ledger atdocs/audits/agent-rules-findings-ledger-<YYYY-MM-DD>-<scope-slug>.mdand workflow state atdocs/audits/agent-rules-workflow-state-<YYYY-MM-DD>-<scope-slug>.json(fall back toaudit-artifacts/agent-rules-{findings-ledger|workflow-state}-<YYYY-MM-DD>-<scope-slug>.{md|json}ifdocs/audits/is unwritable). Report both paths. Roadmaps, issues, and promotion changes still need confirmation.
Modes
Guided Draft (default), Autopilot, Grill Me — contract in
references/modes.md. For promote, default to Grill Me:
shipping a rule is expensive trust-wise, so open the trade-offs rather than infer them.
Output requirements
Every output cites the reflection-log entries it draws on — filenames plus their
What to do differently lines — names the intent, and carries that intent's load-bearing
section (Workflow step 6). For promote, that includes the three-entry floor and, for a
gate, the gate-hardening variant matrix and a regression fixture; for assess-l4l5,
stable ED-L4L5-NNN IDs.
Reference map
references/empirical-warnings-w1.md— the W1 three-entry floor and Mündler citation (owned here).references/empirical-warnings.md,references/lenses.md,references/modes.md,references/trackable-findings.md— symlinks intoskills/_shared/.references/core/maturity-rubric.md— Levels 4–5, extendingagent-readiness's 1–3.references/playbooks/reflection-log.md— the one sub-surface this skill owns.references/playbooks/gate-hardening.md— adversarial variant matrix and fixture requirements for promoted hooks, checks, and CI gates.references/bootstrap-order.md— the Stage 0/1 staging dependency.templates/artifacts/reflection-log/— README (with the recording-vs-promotion callout) and the per-entry_template.md.templates/findings-ledger.md,templates/workflow-state.json— saved tracking forassess-l4l5.evals/— static checks, trigger evals, activation cases.
See also
agent-readiness— scaffolds the project-context AGENTS.md that this skill'spromoteadds rules to. Pair them.agent-experience— the umbrella AX discipline; this is its evidence-and-feedback-loop arm.