name: performance-metrics description: Log task completion data to metrics/. Use at the end of every task to record tokens, cost, agents used, rework cycles, and hallucination events. Also use for periodic reporting to identify efficiency and quality trends. role: worker user-invocable: true
Performance Metrics
Overview
Schema and procedures for capturing performance data in metrics/. Metrics enable evidence-based evaluation of agent effectiveness, cost efficiency, and quality outcomes.
Constraints
- Never log credentials, API keys, or PII in metric entries
- Log entries are append-only; do not modify or delete existing JSONL records
- Log at task completion, not mid-task; mid-task state belongs in
memory/progress files - Use the defined JSONL schema; do not invent new top-level fields without updating the reference
Metric Categories
Targets discipline. Per CLAUDE.md → "Claims discipline", a numeric target may only ship if an instrument measures it. The instrumented metrics below cite their sensor; the rest read "Aspirational — no sensor yet" until one exists (tracked by #102 cost metering and #106 telemetry). Do not reintroduce bare numeric targets —
tests/docs/prose_honesty_test.batsenforces this across all shipped prose.
Efficiency Metrics
| Metric | Description | Target |
|---|---|---|
| Task completion time | Wall-clock time from request to delivery | Track trend, no fixed target |
| Token usage per task | Total input + output tokens consumed | Minimize for comparable quality |
| Agent loading overhead | Tokens spent on agent/skill file reads | Aspirational — no sensor yet |
| Context summarization frequency | How often summarization triggers per task | Aspirational — no sensor yet |
Quality Metrics
| Metric | Description | Target |
|---|---|---|
| First-pass acceptance rate | Tasks accepted without rework | Aspirational — no sensor yet |
| Rework count | Number of revision cycles per task | Aspirational — no sensor yet |
| Hallucination incidents | Outputs containing fabricated information | Aspirational — no sensor yet |
| Accuracy score | Correctness of structured data extraction | Aspirational — no sensor yet |
| Test coverage | Percentage of code covered by generated tests | Track per project |
Cost Metrics
| Metric | Description | Target |
|---|---|---|
| Cost per task | Total API cost (input + output tokens at rate) | Track trend |
| LLM routing ratio | Percentage of tasks routed to each LLM | Track distribution |
| Selective loading savings | Tokens saved vs. loading all agents | Aspirational — no sensor yet |
Log Format
Metrics are stored in metrics/ as JSONL files (one JSON object per line).
File Naming
metrics/{date}-task-log.jsonl
Example: metrics/2026-02-20-task-log.jsonl
Task Completion Entry
Logged at the end of each task:
{
"timestamp": "2026-02-20T14:30:00Z",
"task_id": "unique-id",
"task_type": "implementation",
"task_description": "Build REST API for user authentication",
"agents_used": ["software-engineer", "architect"],
"skills_used": ["hexagonal-architecture"],
"tokens": {
"input": 12500,
"output": 3200,
"total": 15700
},
"cost_usd": 0.043,
"llm": "opus",
"context_summarizations": 0,
"phases": 2,
"rework_cycles": 1,
"accepted": true,
"hallucination_detected": false,
"duration_seconds": 180
}
Field Reference
| Field | Type | Description |
|---|---|---|
timestamp |
string | ISO 8601 completion time |
task_id |
string | Unique identifier for this task |
task_type |
string | implementation, design, bugfix, testing, documentation, analysis |
task_description |
string | Brief description of the task |
agents_used |
string[] | Agent names that were loaded |
skills_used |
string[] | Skill names that were loaded |
tokens.input |
number | Input tokens from API usage field |
tokens.output |
number | Output tokens from API usage field |
tokens.total |
number | Sum of input + output |
cost_usd |
number | Estimated cost based on token rates |
llm |
string | Model ID used |
context_summarizations |
number | Times summarization was triggered |
phases |
number | Number of loading phases |
rework_cycles |
number | Number of revision cycles |
accepted |
boolean | Whether the user accepted the output |
hallucination_detected |
boolean | Whether a hallucination was flagged |
duration_seconds |
number | Wall-clock seconds from start to delivery |
Review Value Entry (#348)
/build appends one entry per inline review checkpoint to
metrics/review-value.jsonl so the pipeline's review overhead becomes
measurable — distinguishing a build where review caught and fixed a real defect
from one where every loop passed no-op. This is the sensor that lets the plan/step
tiering (the /plan plan-tier and /build per-step complexity routing) be
right-sized with evidence rather than guessed.
{
"timestamp": "2026-06-22T14:30:00Z",
"plan": "plans/add-auth.md",
"slice": "2",
"step": "all",
"checkpoint": "slice",
"complexity": "standard",
"agents_run": ["spec-compliance-review", "security-review"],
"issues_found": 1,
"issues_fixed": 1,
"fix_iterations": 1,
"outcome": "fixed"
}
| Field | Type | Description |
|---|---|---|
checkpoint |
string | step (per-step complex review) or slice (batched slice-boundary review) |
step |
string | N.M for a per-step checkpoint; all for a batched slice checkpoint |
agents_run |
string[] | Review agents dispatched at this checkpoint |
issues_found |
number | Actionable issues the checkpoint surfaced |
issues_fixed |
number | Of those, how many were auto-fixed |
fix_iterations |
number | Review-fix loop iterations consumed |
outcome |
string | no-op (passed clean), fixed (found + fixed), escalated (loop didn't converge) |
Privacy: counts and outcomes only — never prompt text, code, or file content,
consistent with the cost meter's privacy boundary. Disable with
DEV_TEAM_REVIEW_VALUE=off. Report it with /cost-report (its "review value"
section).
When to Log
| Event | Action |
|---|---|
| Task completed | Log full task completion entry |
/build inline review checkpoint |
Append a Review Value entry to metrics/review-value.jsonl (#348) |
| Configuration change | Log in metrics/config-changelog.jsonl (see Feedback & Learning skill) |
| Hallucination detected | Flag in task entry + log separately if correction applied |
| Context summarization triggered | Increment counter in current task entry |
Output
JSONL log entries written to metrics/ and/or a summary report of metric trends. Be concise — report anomalies and trend signals; omit entries within normal range.
Reporting
Periodically review metrics to identify patterns:
- Weekly: Review task completion entries for rework trends and hallucination rate
- Monthly: Aggregate cost metrics and LLM routing distribution
- Per-project: Compare first-pass acceptance rate across task types
Summaries can be written to metrics/reports/ for historical reference.