agents-improvement-audit - SKILL.md Agent Skill

name: agents-improvement-audit description: "Observes the health of the auto-research loop across all ai-squad agents over time. Reads run logs, eval score trends, accept/reject patterns, source quality, and drift indicators. Produces a digest of operational findings and applies a narrow set of T1 tuning changes (only purely additive — never loosens guards). Escalates everything else as recommendations. Use when triggered by sdlc-orchestrator heuristics, when the user types /agents-improvement-audit, or when the user wants a periodic check on whether the self-improvement mechanism is producing value or noise."

You are running a structured audit of the auto-research loop's health across all ai-squad agents.

This is meta-observation: you do not critique any agent's domain output. You observe whether the SELF-IMPROVEMENT MECHANISM is producing value, regressing, drifting, or wasting cycles. The skill is intentionally conservative — most findings escalate to the user. Only purely additive query tuning is auto-applied.

Inputs

The user invokes with:

/agents-improvement-audit — full audit (default window: last 30 days)
/agents-improvement-audit --since=YYYY-MM-DD — audit since a specific date

Pre-flight

Stop and report if any check fails.

~/.claude/agents/ is a git repo (required for the auto-research mechanism)
~/.claude/logs/auto-research/ exists with at least 5 total run logs across all agents — auditing too little data produces noise, not signal. If under 5, return "insufficient data — wait for more auto-research runs" and exit cleanly
Compute window start (--since arg or default 30 days ago)
The working tree of ~/.claude/agents/ is clean enough to make focused commits — if dirty with unrelated changes, ask before proceeding

What you read

For each agent in ~/.claude/agents/*.md whose Auto-Research Scope > enabled is true:

Log files in ~/.claude/logs/auto-research/ matching the agent within the window
Git history of the agent file: git -C ~/.claude/agents log --since={start} --pretty=format:'%H %s' -- {agent}.md
Current state of the agent file (size, section structure, current Auto-Research Scope topics)

What you analyze

A. Run statistics per agent

Total runs, commits (auto-commit), proposes (uncommitted left for user), reverts, no-ops
Average baseline score, average post-edit score
Score trend: rising / flat / falling
Time since last run (gaps in scheduling are themselves a signal)

B. Topic productivity

For each topic in the agent's Auto-Research Scope > topics:

Percentage of runs that produced eligible findings from this topic
Topics empty for 5+ consecutive runs → tuning candidate (T1)
Topics empty for 15+ consecutive runs → disable candidate (T3, escalate)

C. Source quality

Domain-frequency of cited sources across the window
Authoritative sources (OWASP, NIST, MITRE, official vendor docs, RFCs, accepted academic venues) vs casual (blog posts, social media, marketing)
Flag any agent where >40% of recent sources are non-authoritative (T3, escalate)

D. Drift indicators

Agent file size growth rate (lines added per week)
Constraint violations attempted per log (rejected by the auto-research skill itself)
Editable section size growth — proxy for mandate scope creep (T3, escalate if any section grew > 50% in window)

E. Cross-agent signals

Same source URL cited convergently by 2+ agents → convergence (good — surface as positive finding)
Edits in domains where agent mandates overlap producing contradictory guidance → ALERT (T3, escalate with file:line refs)

F. Eval health

Eval cases that fail reproducibly across runs → stale or actually broken (T3, escalate for review)
Cases where post-edit score consistently >= baseline + 0.1 across 5+ runs → opportunity to raise threshold (T3, escalate)
Cases where baseline score occasionally drops below the agent's pass_threshold → instability signal (T3, escalate)

T1 actions (auto-apply, narrow scope)

Only these mutations. They never loosen any guard.

T1.A — Broaden empty topic queries

If a topic has been empty for 5+ consecutive runs but its declared why is still relevant to the agent's domain, add a more general query alongside the existing narrow ones. Do not remove queries. Cite the rationale in the commit message.

T1.B — Add a complementary query

If your audit surfaced a query pattern that's been productive in adjacent topics or other agents' related topics, add it to the underperforming topic. Additive only.

Application rules

For every T1 change:

Make the edit in the agent file via the Edit tool
Commit immediately: git -C ~/.claude/agents add {agent}.md && git commit -m "agents-improvement-audit: T1 — {what} ({why-1-line})"
Hard cap: 3 T1 changes per run — the rest queue for next audit

T3 escalation list (write to digest, do NOT apply)

Collect these into the digest. Each entry includes file:line refs (when applicable) and a rationale.

Threshold change candidates (eval scores justify raising pass_threshold)
Policy promotion candidates (agent ran propose for 5+ runs without user rollback)
Topic disable candidates (empty for 15+ runs)
Topic split candidates (one topic produces too many overlapping findings)
Eval case review candidates (failing reproducibly with a consistent pattern)
Cross-agent contradictions
Source quality concerns
Mandate scope creep alerts

Never auto-apply any of these. They affect the safety mechanism itself; bad autonomous changes here would silently degrade quality across all agents.

Versioning

Before any T1 mutation:

git -C ~/.claude/agents tag pre-agents-improvement-audit-{YYYYMMDD-HHMM}

After all T1 mutations (even if 0):

git -C ~/.claude/agents tag post-agents-improvement-audit-{YYYYMMDD-HHMM}

Document rollback in the digest output, exact command included.

Output

Write a digest to ~/.claude/logs/agents-improvement-audit/{YYYY-MM-DD}.md:

---
date: YYYY-MM-DD
window_start: YYYY-MM-DD
agents_audited: N
total_runs_in_window: N
t1_applied: M
t3_escalated: K
checkpoint_pre: pre-agents-improvement-audit-{ts}
checkpoint_post: post-agents-improvement-audit-{ts}
---

## Per-agent summary
| agent | runs | commits | proposes | reverts | no-ops | baseline avg | post-edit avg | trend |
|---|---|---|---|---|---|---|---|---|
| ... | | | | | | | | |

## Topics needing attention
| agent | topic | runs empty | recommendation | tier |
|---|---|---|---|---|

## Source quality concerns
[per-agent breakdown when authoritative-source ratio < 60%]

## Drift indicators
[agents with file size or section growth above threshold]

## Cross-agent signals
### Convergence (positive)
- [source URL] cited by [agent A, agent B] in [topic A, topic B] — converging on [observation]

### Contradiction (alert)
- [agent A {file:line}] vs [agent B {file:line}] — [contradiction description]

## Eval health
[per-agent eval observations]

## T1 applied (auto-committed)
- {agent}: {what} — {commit sha}

## T3 escalations (require your decision)
### {kind} — {agent}
**Detail:** [refs]
**Recommendation:** [specific action]
**Rationale:** [why now]

## Rollback
git -C ~/.claude/agents reset --hard pre-agents-improvement-audit-{ts}

After writing the digest, surface a 5-line summary in the conversation:

agents-improvement-audit complete:
- N agents audited, M T1 changes auto-applied, K T3 items escalated
- {brief health summary in one sentence}
- Top escalation: [the most important one in 1 line]
- Full digest: ~/.claude/logs/agents-improvement-audit/{date}.md
- Rollback if needed: {git command}

Hard rules

Never edit anything other than topics[].queries in T1 mode
Never disable a topic, change a threshold, promote a policy, or edit an eval case in T1 — these always escalate
Never edit any section other than the agent's Auto-Research Scope block
Never remove existing queries — only add to them
Always include rationale (why this tuning) in the commit message
Always cite file:line references for every T3 escalation that points at specific content
If insufficient data (< 5 total runs across all agents), exit cleanly without creating a digest — auditing 1-2 runs produces noise

Failure modes to watch

All agents trending flat at perfect score: likely the eval cases are too easy, not that the agents are perfect. Surface as T3.
Consistent reverts on one agent: the eval is doing its job, but research is producing low-quality findings. Surface the topic-level rejection patterns as T3.
No findings ever from a topic that gets queries every day: queries may be over-specified or the domain is genuinely stable; T1 broaden once, then if still empty, T3 disable.