name: session-review
description: >-
Mine real Claude Code session transcripts to suggest plugin improvements that
cut token spend, reduce re-work, and improve accuracy. Use when the user asks
to "review my sessions", "where am I wasting tokens", "why does this keep
re-doing work", or "/session-review".
argument-hint: "[--cwd ] [--transcript ] [--out ]"
user-invocable: true
allowed-tools: >-
Read, Glob, Bash(python3 *, date *, mkdir *), Write, Agent
Session Review (#131)
Role: orchestrator. Mines ground-truth session transcripts and routes suggestions into existing machinery — it suggests, never auto-applies, and preserves every human gate.
You have been invoked with the /session-review command.
Orchestrator constraints
- Never read raw transcripts yourself. All heavy parsing is the deterministic extractor's job; you read only its KB-sized digest. Spending model tokens to study token spend defeats the purpose. The one exception is the gated raw-log tier (Step 2b): even there you never read a raw log — you dispatch one isolated sub-agent per flagged log and keep only its metrics-only findings.
- Suggest, never apply. Output a ranked report and hand off; do not edit agents, skills, or config. Human gates stay intact.
- Metrics only. The digest and report contain counts/ratios/names — never prompt or code content.
Argument: $ARGUMENTS
--cwd <path>: project whose transcripts to mine (default: current project).--transcript <file>: analyze a specific transcript instead of auto-resolving.--out <report>: report path (default:reports/session-review-<date>.md).
Steps
0. Queued Findings — surface pending-review queue before fresh analysis
Before running fresh analysis, check whether background analysis has produced findings that are waiting for review.
Check
metrics/pending-review.jsonlfor entries wherereviewed_atis absent. Count them and note theirqueued_attimestamps.If one or more unreviewed entries exist, display a Queued Findings section:
## Queued Findings (N unreviewed) | # | queued_at | source | findings | |---|-----------|--------|----------| | 1 | <ts> | <src> | <count> |Then offer: "Run
/feedback-learningto approve or reject each queued finding before proceeding with fresh analysis."If
metrics/pending-review.jsonldoes not exist, or all entries havereviewed_at, skip this section silently and proceed.If any JSONL lines are malformed, skip those lines, emit a one-line warning (
WARN: skipped N malformed line(s) in pending-review.jsonl), and continue with the valid entries.
pending-review.jsonl Schema
Each line is a JSON object:
{
"queued_at": "2026-06-01T12:00:00Z",
"source": "session-learning-trigger",
"session_id": "abc-123",
"findings": [
{
"lever": "instruction-rule",
"evidence": "3 occurrences in last 5 sessions",
"target_artifact": "agents/orchestrator.md",
"proposed_change": "Add constraint: always load context-loading-protocol first",
"route": "feedback-learning"
}
]
}
Optional fields added by /feedback-learning after disposition:
reviewed_at(ISO-8601 UTC) +approved_by— written on approvalrejected_at(ISO-8601 UTC) +rejected_by— written on rejection
1. Cross-machine Telemetry — validate config, then sync (#178)
Before analysing, check whether a telemetry repository (the cross-machine "database", Delta D) is configured, so the digest reflects every machine, not just this one:
bash ${CLAUDE_PLUGIN_ROOT}/../../scripts/telemetry-sync.sh --check
Exit 0 → a repo is configured. Run the sync to push this machine's digest and pull the others, then continue:
bash ${CLAUDE_PLUGIN_ROOT}/../../scripts/telemetry-sync.shExit 3 → no repo configured. Ask the user for the telemetry repo location (a git URL), e.g. "Where should cross-machine telemetry be stored? Paste a private git repo URL, or say 'skip' to review this machine only."
- If they give a URL, write it to
~/.claude/.dev-team/telemetry.jsonas{ "remote": "<url>" }(create the dir if needed), confirm, then run the sync command above. Point them attelemetry-repo-security.mdfor the one-time deploy-key/token setup. - If they say skip, proceed local-only — do not block the review.
- If they give a URL, write it to
Never invent a URL or enable anything without the user's explicit location.
2. Extract (deterministic, zero model tokens)
Run the extractor to produce the digest:
python3 ${CLAUDE_PLUGIN_ROOT}/../../scripts/session_extract.py \
--plugin-root ${CLAUDE_PLUGIN_ROOT} -o memory/session-digest.json
(Pass --transcript <file> or --cwd <path> through from $ARGUMENTS.) If the
extractor finds no transcripts, tell the user and stop — nothing to review.
If a telemetry repo synced in Step 0, also build the cross-machine rollup (the union of every host's digest, #178) and prefer it for analysis — it sees all machines and projects, not just this one:
CLONE="${DEV_TEAM_TELEMETRY_CLONE:-$HOME/.claude/.dev-team/agent-telemetry}"
python3 ${CLAUDE_PLUGIN_ROOT}/../../scripts/session_extract.py \
--plugin-root ${CLAUDE_PLUGIN_ROOT} --rollup "$CLONE/digests" \
-o memory/telemetry-rollup.json
The rollup is metrics-only (telemetry-rollup/v1): per-host and per-project
token/cost, summed rework/accuracy, and skills/agents never invoked on any
machine. Hand the analysis agent the rollup when present, the local digest
otherwise.
Then compute the frequency → lever escalation (Delta C, #179) — recurrence decides how strong a response each friction earns:
python3 ${CLAUDE_PLUGIN_ROOT}/../../scripts/session_extract.py \
--plugin-root ${CLAUDE_PLUGIN_ROOT} --escalate "$CLONE/digests" \
-o memory/telemetry-escalation.json
Each recommendation carries a lever: hint (rare — surface only),
instruction-rule (recurring; hand to /feedback-learning), or hook
(frequent and deterministically matchable; validate via /agent-eval before
shipping). "Matchable" is the deterministic side of the rules-vs-prompts ≤10% FP
policy. Use the escalation lever to set the hand-off in Step 3.
Optionally compute the gate correlation (process eval, #111) — does bypassing the pre-commit review gate correlate with more rework across sessions?
python3 ${CLAUDE_PLUGIN_ROOT}/../../scripts/session_extract.py \
--plugin-root ${CLAUDE_PLUGIN_ROOT} --correlate "$CLONE/digests" \
-o memory/gate-correlation.json
This compares mean rework between bypass and non-bypass committing sessions. It is correlational, not causal — surface it as evidence for whether the review gate earns its place (feeds the ADR-0006 decision, #112), never as proof.
3. Analyze (digest-only)
Dispatch the session-analysis agent with the digest path as its sole input.
The agent maps aggregated patterns to probable plugin causes and returns ranked
suggestions, each tagged {token | rework | accuracy} with a named target
artifact and a hand-off destination. The agent reads only the digest.
3b. Raw-log semantic tier — only the worst sessions (#214, Delta A/B)
The digest is quantitative, so it is blind to frictions with no count-signature
— a hallucinated citation (the AI cited a skill/source that does not exist), or an
operator habit (deferring decisions with no owner). Surface those with a bounded
second tier: the deterministic digest decides where it is worth spending tokens,
then you read only those few raw logs. This tier needs the cross-machine digests
(the $CLONE/digests from Step 0); if no telemetry repo synced, skip it.
Flag the worst sessions (deterministic, zero model tokens):
python3 ${CLAUDE_PLUGIN_ROOT}/../../scripts/eval_rawlog.py \ --flag "$CLONE/digests" --top 5 -o memory/worst-sessions.jsonIf it flags nothing, skip this tier — there is nothing expensive to read.
For each flagged
session_idonly, locate its raw transcript (~/.claude/projects/**/<session_id>.jsonl) and dispatch one sub-agent per transcript (one ~1MB log = one agent = one context boundary; fan out in parallel). Each sub-agent reads its single raw log and returns semantic frictions the digest cannot see. You never read the raw log yourself (constraint 1); the raw log never leaves the machine. Require each finding in the metrics-only shape the validator enforces — only these keys:{lens, friction_type, target_artifact, confidence, count}(plus optionalsession_id/project/host), withlens∈{methodology, harness, devex, accuracy}. A finding may say that a friction occurred and which artifact to fix, never a prompt, code, path, or quote.Mechanically verify the privacy boundary before keeping anything — drop (do not report) any finding with violations:
python3 ${CLAUDE_PLUGIN_ROOT}/../../scripts/eval_rawlog.py \ --validate memory/tier2-findings.json
4. Suggest (write the report)
Write reports/session-review-<date>.md (or --out). Rank the suggestions and,
for each, record: the tag ({token|rework|accuracy} from the digest, or a Tier-2
lens {methodology|harness|devex}), the evidence (metrics only), the concrete
target artifact, the proposed change, and the hand-off destination from the table
below. Nothing is auto-applied.
The methodology lens (from Step 2b only) observes the operator's own habits
(e.g. deferring decisions with no owner). These have no target artifact and no
hook — their hand-off is "to the human, as an observation." Put them under their
own report heading; never route them to a gate.
| Suggestion kind | Hand off to |
|---|---|
| Config / prompt / convention fix | /feedback-learning |
| Effort re-banding | /harness-audit + .claude/model-ladder.json |
| New / changed detection rule | /agent-eval (validate before shipping) |
| Token-heavy skill / agent | token-efficiency-review |
Operator methodology observation (methodology) |
the human, as an observation — no artifact, no hook |
5. Persist the trend (#129)
Append one metrics-only summary record to the trend stream so /harness-audit
can consume real-session data over time:
python3 ${CLAUDE_PLUGIN_ROOT}/../../scripts/session_extract.py \
--plugin-root ${CLAUDE_PLUGIN_ROOT} --append metrics/session-digest.jsonl >/dev/null
The appended record holds aggregate counts only — no file names, prompts, or code (see the schema in the eval-system docs).
6. Report
Print the report path and the top-ranked suggestions. Do not invent numbers — cite exactly what the digest and the analysis agent emit.
OSS complements
For continuous quantitative monitoring, recommend (don't replace) ccusage,
native OpenTelemetry, and claude-code-log. This skill covers the
plugin-specific qualitative suggestions those tools cannot — they don't know
this plugin's agents/skills. See the eval-system docs for details.