performance-metrics - SKILL.md Agent Skill

name: performance-metrics description: Log task completion data to metrics/. Use at the end of every task to record tokens, cost, agents used, rework cycles, and hallucination events. Also use for periodic reporting to identify efficiency and quality trends. role: worker user-invocable: true

Performance Metrics

Overview

Schema and procedures for capturing performance data in metrics/. Metrics enable evidence-based evaluation of agent effectiveness, cost efficiency, and quality outcomes.

Constraints

Never log credentials, API keys, or PII in metric entries
Log entries are append-only; do not modify or delete existing JSONL records
Log at task completion, not mid-task; mid-task state belongs in memory/ progress files
Use the defined JSONL schema; do not invent new top-level fields without updating the reference

Metric Categories

Targets discipline. Per CLAUDE.md → "Claims discipline", a numeric target may only ship if an instrument measures it. The instrumented metrics below cite their sensor; the rest read "Aspirational — no sensor yet" until one exists (tracked by #102 cost metering and #106 telemetry). Do not reintroduce bare numeric targets — tests/docs/prose_honesty_test.bats enforces this across all shipped prose.

Efficiency Metrics

Metric	Description	Target
Task completion time	Wall-clock time from request to delivery	Track trend, no fixed target
Token usage per task	Total input + output tokens consumed	Minimize for comparable quality
Agent loading overhead	Tokens spent on agent/skill file reads	Aspirational — no sensor yet
Context summarization frequency	How often summarization triggers per task	Aspirational — no sensor yet

Quality Metrics

Metric	Description	Target
First-pass acceptance rate	Tasks accepted without rework	Aspirational — no sensor yet
Rework count	Number of revision cycles per task	Aspirational — no sensor yet
Hallucination incidents	Outputs containing fabricated information	Aspirational — no sensor yet
Accuracy score	Correctness of structured data extraction	Aspirational — no sensor yet
Test coverage	Percentage of code covered by generated tests	Track per project

Cost Metrics

Metric	Description	Target
Cost per task	Total API cost (input + output tokens at rate)	Track trend
LLM routing ratio	Percentage of tasks routed to each LLM	Track distribution
Selective loading savings	Tokens saved vs. loading all agents	Aspirational — no sensor yet

Log Format

Metrics are stored in metrics/ as JSONL files (one JSON object per line).

File Naming

metrics/{date}-task-log.jsonl

Example: metrics/2026-02-20-task-log.jsonl

Task Completion Entry

Logged at the end of each task:

{
  "timestamp": "2026-02-20T14:30:00Z",
  "task_id": "unique-id",
  "task_type": "implementation",
  "task_description": "Build REST API for user authentication",
  "agents_used": ["software-engineer", "architect"],
  "skills_used": ["hexagonal-architecture"],
  "tokens": {
    "input": 12500,
    "output": 3200,
    "total": 15700
  },
  "cost_usd": 0.043,
  "llm": "opus",
  "context_summarizations": 0,
  "phases": 2,
  "rework_cycles": 1,
  "accepted": true,
  "hallucination_detected": false,
  "duration_seconds": 180
}

Field Reference

Field	Type	Description
`timestamp`	string	ISO 8601 completion time
`task_id`	string	Unique identifier for this task
`task_type`	string	`implementation`, `design`, `bugfix`, `testing`, `documentation`, `analysis`
`task_description`	string	Brief description of the task
`agents_used`	string[]	Agent names that were loaded
`skills_used`	string[]	Skill names that were loaded
`tokens.input`	number	Input tokens from API usage field
`tokens.output`	number	Output tokens from API usage field
`tokens.total`	number	Sum of input + output
`cost_usd`	number	Estimated cost based on token rates
`llm`	string	Model ID used
`context_summarizations`	number	Times summarization was triggered
`phases`	number	Number of loading phases
`rework_cycles`	number	Number of revision cycles
`accepted`	boolean	Whether the user accepted the output
`hallucination_detected`	boolean	Whether a hallucination was flagged
`duration_seconds`	number	Wall-clock seconds from start to delivery

Review Value Entry (#348)

/build appends one entry per inline review checkpoint to metrics/review-value.jsonl so the pipeline's review overhead becomes measurable — distinguishing a build where review caught and fixed a real defect from one where every loop passed no-op. This is the sensor that lets the plan/step tiering (the /plan plan-tier and /build per-step complexity routing) be right-sized with evidence rather than guessed.

{
  "timestamp": "2026-06-22T14:30:00Z",
  "plan": "plans/add-auth.md",
  "slice": "2",
  "step": "all",
  "checkpoint": "slice",
  "complexity": "standard",
  "agents_run": ["spec-compliance-review", "security-review"],
  "issues_found": 1,
  "issues_fixed": 1,
  "fix_iterations": 1,
  "outcome": "fixed"
}

Field	Type	Description
`checkpoint`	string	`step` (per-step `complex` review) or `slice` (batched slice-boundary review)
`step`	string	`N.M` for a per-step checkpoint; `all` for a batched slice checkpoint
`agents_run`	string[]	Review agents dispatched at this checkpoint
`issues_found`	number	Actionable issues the checkpoint surfaced
`issues_fixed`	number	Of those, how many were auto-fixed
`fix_iterations`	number	Review-fix loop iterations consumed
`outcome`	string	`no-op` (passed clean), `fixed` (found + fixed), `escalated` (loop didn't converge)

Privacy: counts and outcomes only — never prompt text, code, or file content, consistent with the cost meter's privacy boundary. Disable with DEV_TEAM_REVIEW_VALUE=off. Report it with /cost-report (its "review value" section).

When to Log

Event	Action
Task completed	Log full task completion entry
`/build` inline review checkpoint	Append a Review Value entry to `metrics/review-value.jsonl` (#348)
Configuration change	Log in `metrics/config-changelog.jsonl` (see Feedback & Learning skill)
Hallucination detected	Flag in task entry + log separately if correction applied
Context summarization triggered	Increment counter in current task entry

Output

JSONL log entries written to metrics/ and/or a summary report of metric trends. Be concise — report anomalies and trend signals; omit entries within normal range.

Reporting

Periodically review metrics to identify patterns:

Weekly: Review task completion entries for rework trends and hallucination rate
Monthly: Aggregate cost metrics and LLM routing distribution
Per-project: Compare first-pass acceptance rate across task types

Summaries can be written to metrics/reports/ for historical reference.