name: harness-audit
description: >-
Analyze review agent effectiveness, model routing, and orchestration complexity
against actual usage data. Produces a report of harness components that may be
candidates for simplification or removal. Use periodically to prevent harness
staleness as model capabilities improve. Audits the dev-team plugin's OWN
harness from runtime metrics — not your project repo's readiness (for that,
use /agent-readiness).
argument-hint: "[--output ]"
user-invocable: true
allowed-tools: Read, Glob, Grep, Bash(date *), Write
Harness Audit
Role: orchestrator. This command analyzes harness effectiveness — it does not modify agents or configuration.
You have been invoked with the /harness-audit command.
Not
/agent-readiness. This audits the dev-team plugin's own harness (review-agent effectiveness, model tiers, orchestration) from accumulated runtime metrics inmetrics/./agent-readinessscores the subject repository's readiness for AI-assisted development from a static checkout. Different subject, different input, different output. The inward-facing companion here is/session-review, whosesession-digest.jsonlthis command consumes (Step 1).
Orchestrator constraints
- Do not modify agents or configuration. Produce a report only. All remediation requires human action.
- Write the report to a file. Present only the summary table and next-steps in chat — do not repeat the full report.
- Be concise. Use tables and short sentences. No preambles, no filler.
Parse Arguments
Arguments: $ARGUMENTS
--output <path>: Write report to a specific path. Default:reports/harness-audit-<date>.md
Steps
1. Check for metrics data
Read metrics JSONL files from metrics/. Three complementary streams exist:
metrics/*-task-log.jsonl— self-reported task logs (whatever the model chose to record about itself).metrics/session-digest.jsonl— ground-truth real-session digests from/session-review(#129): token/cost trends,rework/accuracycounts, andutilization.never_observed_*. Prefer this where it disagrees with the self-reports, and usenever_observed_*to corroborate stale-component flags. Schema + join: seedocs/eval-system.md→ "Session-review trend digest".metrics/artifact-usage.json— per-artifact usage index written by the telemetry hook on each Skill invocation. Uselast_used_atto identify artifacts that have never been observed (absent from the index) or are stale (absent orlast_used_at> 30 days ago). Cross-reference withnever_observed_*insession-digest.jsonlfor corroboration. Seeknowledge/artifact-lifecycle.mdfor the lifecycle threshold definitions.
If no metrics data exists or insufficient data is available (fewer than 10 review runs logged), report:
"Insufficient metrics data for a meaningful audit. Run the system for a period to accumulate review data, then re-run
/harness-audit. Minimum: 10 logged review runs."
List what data is missing and exit.
2. Analyze review agent effectiveness
For each review agent in the registry (knowledge/agent-registry.md):
- Finding rate: How often does this agent produce findings (fail or warn) vs. pass?
- Zero-fail agents: Flag agents that have never returned
failacross all logged reviews. These are removal candidates — they may not be catching real issues. - False positive rate: If correction data exists (from
/apply-fixes), check how often findings were dismissed vs. applied. Agents with >50% dismissed findings have a high false positive rate. - Finding severity distribution: Is the agent producing mostly minor findings? If >80% of findings are minor severity, consider whether the agent justifies its token cost.
3. Analyze review-value fix rates
Read metrics/review-value.jsonl (written by /build per #348, schema in performance-metrics). If the file is absent, note it and continue — this section is skippable.
For each checkpoint type (the checkpoint field: step or slice) and each agent combination (agents_run list, treated as a set-key), compute:
[ -f metrics/review-value.jsonl ] && jq -s '
group_by(.checkpoint + "|" + (.agents_run | sort | join(",")))
| map({
checkpoint: .[0].checkpoint,
agents: (.[0].agents_run | sort | join(", ")),
total: length,
no_op: (map(select(.outcome=="no-op")) | length),
fixed: (map(select(.outcome=="fixed")) | length),
escalated: (map(select(.outcome=="escalated")) | length),
fix_rate: ((map(select(.outcome=="fixed")) | length) / length * 100 | round),
issues_found: (map(.issues_found) | add // 0),
issues_fixed: (map(.issues_fixed) | add // 0),
fix_iterations:(map(.fix_iterations)| add // 0)
})' \
metrics/review-value.jsonl
Flag drop candidates: any checkpoint+agents combination with fix_rate == 0 across N ≥ 5 logged runs is a drop candidate — it consistently adds overhead without catching defects.
Flag high-value checkpoints: fix_rate ≥ 50% — these are earning their cost and should be retained.
Drop-candidate recommendations (P2-S3): For each drop candidate emit a recommendation in this form:
<checkpoint>/<agents>fixed 0/runs (fix rate 0%) — candidate to drop. To act: remove this checkpoint type from the relevant /buildstep-complexity tier or exclude these agents from the checkpoint's dispatch list. Do not auto-edit skills; present for human decision.
Do not modify any skill or agent file. The report is the only artifact.
4. Analyze model routing
For each agent listed in knowledge/agent-registry.md (with model tier from its model: frontmatter, resolved via the PreToolUse hook per agents/orchestrator.md → Resolution Procedure):
- Over-tiered agents: Agents assigned to opus that consistently produce simple pattern-match findings may work equally well on sonnet or haiku.
- Under-tiered agents: Agents on haiku that frequently miss issues caught by human review may need a higher tier.
- Cost distribution: Which agents consume the most tokens? Are the most expensive agents also the most valuable?
5. Analyze orchestration complexity
Review the current pipeline for components that may be unnecessary overhead:
- Phase count: Are all three phases (Research, Plan, Implement) needed for the types of tasks being run? If most tasks are simple, suggest a fast path.
- Review checkpoint frequency: Are inline reviews running on every step? If most steps are trivial, the complexity classification (see
skills/plan/SKILL.md§ Complexity Classification) should be catching this. - Unused skills: Skills loaded but never applied in logged sessions.
6. Produce report
Write the report to the output path using this structure:
# Harness Audit Report
**Date**: <date>
**Metrics period**: <earliest to latest logged review>
**Review runs analyzed**: <count>
## Review Agent Effectiveness
### Removal Candidates (zero fail findings)
| Agent | Reviews | Pass rate | Recommendation |
|-------|---------|-----------|----------------|
### High False Positive Rate (>50% dismissed)
| Agent | Findings | Dismissed | Rate | Recommendation |
|-------|----------|-----------|------|----------------|
### Low-Value Agents (>80% minor severity)
| Agent | Findings | Minor % | Recommendation |
|-------|----------|---------|----------------|
## Review-Value Fix Rates (inline checkpoint ROI)
> Source: `metrics/review-value.jsonl`. Absent = no `/build` runs logged yet.
### Per-Checkpoint-Type Fix Rates
| Checkpoint | Agents | Runs | No-op | Fixed | Escalated | Fix rate |
|------------|--------|------|-------|-------|-----------|----------|
### Drop Candidates (fix rate 0%, N ≥ 5 runs)
| Checkpoint | Agents | Runs | Recommendation |
|------------|--------|------|----------------|
> To act on a drop candidate: remove the checkpoint type from the relevant `/build`
> step-complexity tier or exclude the agents from that checkpoint's dispatch list.
> Requires human decision — do not auto-edit skills.
### High-Value Checkpoints (fix rate ≥ 50%)
| Checkpoint | Agents | Runs | Fix rate | Issues fixed |
|------------|--------|------|----------|--------------|
## Model Routing Recommendations
| Agent | Current tier | Suggested tier | Rationale |
|-------|-------------|----------------|-----------|
## Orchestration Simplification Opportunities
- <Finding and recommendation>
## Summary
- Agents to consider removing: <count>
- Model tier changes suggested: <count>
- Orchestration simplifications: <count>
- Review-value drop candidates: <count>
- Review-value high-value checkpoints: <count>
## Next Steps
<Actionable recommendations prioritized by impact>
7. Present results
Display a summary of the report and the file path. Do not repeat the full report in chat — the file is the artifact.
Error Handling
- Missing metrics files: Report what's missing, suggest how to generate data
- Incomplete agent registry: Flag agents found in metrics but missing from the registry
- No actionable findings: Report that the harness appears well-calibrated — this is a valid outcome