pr-review-coach - SKILL.md Agent Skill

name: pr-review-coach description: Companion to the `pr-atom-reviewer` skill. Reads the review history stored in AllSource Prime (the `prime_*` MCP tools) and turns it into actionable improvements to how the reviewer works. Three jobs — (1) ingest post-merge outcomes for past reviews (was the split followed? did the PR cause a revert?), (2) cluster the repo's actual atom vocabulary so the calibration table reflects this team's work instead of generic examples, (3) surface contradictions across past reviews so the user can decide what to harden into the SKILL.md. Use this skill when the user asks "what have we learned from reviews", "audit my past reviews", "what should I tighten in the reviewer skill", "did my recommended splits get followed", "how is the reviewer doing", "weekly review retro", or runs a periodic skill-improvement pass. Trigger also on phrases like "PR retro", "review postmortem", "calibrate my reviewer", or when the user mentions wanting to update the calibration table or the SKILL.md for the reviewer.

PR Review Coach

A meta-skill that improves the pr-atom-reviewer skill over time by reading from AllSource Prime and producing concrete edits the user can apply.

This skill runs in Claude Code in the terminal. It expects:

The prime_* MCP tools (from allsource-prime) to be available.
The pr-atom-reviewer skill to be installed and to have recorded at least a few reviews via its Step 7.
git, gh (GitHub CLI), and jq available for outcome-tracking jobs.

If Prime isn't available, stop and tell the user: this skill is useless without the memory store the reviewer writes to. Direct them to install allsource-prime and configure it as an MCP server in Claude Code:

cargo install allsource-prime
claude mcp add prime -s user -- allsource-prime --data-dir ~/.prime/memory

Then re-run the reviewer on a few PRs to build up a history before invoking this skill.

The three jobs (Loops 2–4)

This skill has three distinct jobs. Always ask the user which one they want first; don't try to run all three in one go — each produces a focused output worth reviewing on its own.

1. ingest-outcomes — Loop 2: tag past reviews with what happened after
2. cluster-vocab   — Loop 3: build a repo-specific calibration table
3. find-drift      — Loop 4: surface contradictions across past reviews

If the user says "do all of them", run them in that order — outcomes first (it's the most valuable signal), vocab second (it depends on having outcome-tagged reviews), drift last (it depends on both).

Job 1: ingest-outcomes (Loop 2)

The point: every review the pr-atom-reviewer made is a prediction (this PR will cause problems if merged as-is / this split is the right shape). Real life either confirms or refutes it. Without ingesting outcomes, the reviewer is making predictions into a void.

What to gather

For each pr_review node in Prime, find out:

Was the PR merged as-was, or split? If split: how many of the recommended atoms ended up as separately-merged PRs? Did any sit unmerged for >7 days?
Did the merged PR cause a revert, hotfix, or follow-up "fix" PR within 7 days? This is the strongest "the reviewer should have blocked harder" signal.
Did the team actually attach the e2e proof the reviewer demanded? Or did they merge anyway?

How to gather it

The user's setup may not have a fully-automated pipeline. Be pragmatic — use what's available:

List pending review nodes. Call prime_search (or whatever Prime tool lists nodes by type) for nodes of type pr_review that don't yet have an outcome property set. Limit to a reasonable batch (say, 20 at a time).
For each review node, reconstruct the PR identity from its properties: repo_key, branch, reviewed_at. Then try, in order:
- GitHub CLI: if repo_key is a GitHub URL, run gh pr list --repo <owner/repo> --head <branch> --state all --json number,state,mergedAt,closedAt,title to find the PR. If multiple match (rebases, retries), take the most recent.
- Git log: if gh isn't available or the repo isn't on GitHub, git log --all --grep="<branch>" and git log origin/<base>..HEAD --since=<reviewed_at> to find when (if ever) the branch merged.
- Ask the user as a last resort, batched: "Here are 12 reviews I can't auto-resolve. For each, tell me: merged-as-is / split / abandoned." Don't ask one at a time; that's hostile.
For merged PRs, check for follow-up trouble: gh pr list --repo <owner/repo> --base <base> --search "in:title fix OR revert OR hotfix" --created ">=<merge_date>" --created "<=<merge_date + 7d>". Any matches that mention the original branch name or touched the same files are likely "the reviewer should have caught this" signals.
For split PRs, check whether each recommended atom got a corresponding merged PR. Match by branch-name suggestions the reviewer made (these are stored on the atom nodes). Atoms with no matching merged PR within 14 days are stale — flag for the user.

What to write back to Prime

For each review node, set or add:

outcome — one of: merged_as_was, split_as_recommended, split_partially, abandoned, unknown
caused_followup_within_7d — boolean (with a followup_pr_url if true)
proof_actually_attached — boolean (only if you can verify from the PR description)
outcome_observed_at — timestamp when this skill made the determination

Use whatever Prime tool updates node properties — check the tool's schema first; don't guess.

For atoms recorded under split-but-not-all-shipped reviews, add an atom_status property (merged, stale, unknown).

The output to the user

A short table:

Ingested outcomes for N reviews:

  Verdict given     | Outcome             | Caused 7-day followup
  ------------------|---------------------|----------------------
  NEEDS SPLIT (12)  | split (8)           | 1
                    | merged as-was (3)   | 2  ⚠
                    | abandoned (1)       | —
  REQUEST CHANGES (5) | merged after fix (4) | 0
                    | merged as-was (1)   | 1  ⚠
  APPROVE (23)      | merged (22)         | 1
                    | abandoned (1)       | —

Signals worth your attention:
  - 2 of 3 NEEDS-SPLIT verdicts that were merged as-was caused followups.
    The reviewer was right to push back; the team overrode it.
  - 1 APPROVE caused a followup — review #47 (branch `add-csv-export`).
    Worth re-reading what the reviewer missed there.

Don't editorialise beyond the signals. The numbers are the message.

Job 2: cluster-vocab (Loop 3)

The point: the pr-atom-reviewer ships with a generic calibration table ("Add a priority column" = right-sized, "Build the priority feature" = too big). That's a placeholder. After 30+ atoms have been recorded for a given repo, the repo's own history is a far better calibration table.

What to do

Pull atoms for the target repo. Ask the user which repo to cluster (or default to the repo at the current working directory if it has any recorded reviews). Call prime_neighbors on the repo node with type pr_review, then prime_neighbors on each review with type atom. Collect everything.
Filter to atoms whose review had a good outcome — verdict = APPROVE or verdict = NEEDS SPLIT && outcome = split_as_recommended. These are atoms that worked. Don't cluster atoms from reviews that caused followups; those are bad examples.
Group by independence_technique. pure_refactor, flag_off, dead_code_unused_export, storybook_only, etc. Within each technique, list the actual behaviour strings.
Hand it to Prime's vector index for grouping. Call prime_similar on the atom embeddings (or, if there are fewer than ~50 atoms, just list them all — clustering 30 items by hand is fine). Group atoms whose embeddings cluster together; label each cluster with a short common phrase (e.g. "schema migration", "new dormant service", "additive UI component").
Compare against the reviewer's generic calibration table. Note where this repo's vocabulary differs:
- Categories the generic table has that this repo never uses
- Categories this repo uses heavily that the generic table doesn't mention
- Behaviours this repo treats as "one atom" that the generic table would split (or vice versa)

The output to the user

A drop-in replacement calibration table the user can paste into the reviewer's SKILL.md, plus a diff against the generic one:

## Repo-specific calibration table for <repo_key>

Based on N atoms across M reviews (outcome-good only):

| Right-sized (one atom) | Too big (multiple atoms) |
|---|---|
| <real example from this repo> | <real example from this repo> |
| <real example>                | <real example>                |
| ...                            | ...                            |

Differences from the generic table:
  - This repo uses `flag_off` for 40% of atoms — the generic table doesn't emphasize this enough.
  - "Add a Storybook entry" is consistently treated as part of the component atom here,
    not its own. The generic table implies they could be separate.
  - The generic example "Add a Stripe webhook handler with no business logic" never
    appears here — this team doesn't use that pattern.

To apply: copy the table above into the reviewer's SKILL.md, replacing the
generic calibration table in the "Identify the atoms" section.

Don't apply the change yourself. Show it; let the user decide.

Job 3: find-drift (Loop 4)

The point: the reviewer's judgement should be consistent across reviews. If review #12 called a Storybook-only component an independent atom and review #47 demanded it be wired up before merging, the reviewer drifted — or one of the two reviews was wrong. Either way, the user needs to know.

What to do

Pull every atom from every review (this repo or all repos, ask the user). For each, you have: behaviour, independence_technique, the verdict of the parent review, and the embedding.
Find near-duplicate atom pairs. Use prime_similar to find atoms whose embeddings are within a tight similarity threshold but whose reviews reached different conclusions. Concretely, look for pairs where:
- The behaviour embeddings are close
- The independence_technique differs (e.g. one was flag_off, the other was conceded_single_atom), or
- The verdicts differ given similar disruption stats
Look for shifts over time. For atoms grouped by similarity, sort by reviewed_at. If the older ones all use one technique and the newer ones use a different one, that's drift — either the reviewer learned something or it lost the plot. Either way, flag it.
Look for criteria drift. Pull all verifiable_criterion strings. Cluster them. If the same kind of behaviour ("user can reset password") got criteria as varied as "Works correctly" (bad) and "Reset email arrives within 10 seconds" (good) across different reviews, the reviewer is being inconsistent about how it demands criteria.

The output to the user

A list of named contradictions, each with the two review IDs, what they each concluded, and a brief speculation about which was right:

3 contradictions found:

(1) Reviews #12 and #47 — similar atom, different conclusions
    #12 (2026-03-14): "Add NotificationBell component (Storybook only)"
        → called this an independent atom (storybook_only)
        → verdict: APPROVE
    #47 (2026-05-02): "Add SearchBar component (Storybook only)"
        → required it be wired into a page before merging
        → verdict: REQUEST CHANGES
    Both PRs had similar shape (one new component, Storybook entry, no consumers).
    The earlier review's call was probably right; the later one is over-strict.
    The team merged both eventually with no followups.

(2) Drift in lockfile handling
    Pre-2026-04-15: lockfile churn over 1000 lines was flagged as a scope leak.
    Post-2026-04-15: it's silently ignored.
    Five reviews on either side. The threshold shifted with no rule change.

(3) Criteria quality is regressing
    Earliest 10 reviews: 8/10 atoms had specific verifiable criteria.
    Latest 10 reviews: 4/10 do; the rest say things like "feature works".
    The reviewer is hedging more. Worth tightening the rule in SKILL.md.

Then, ask the user: "Want me to draft SKILL.md edits to address these?" Don't draft them unsolicited — drift findings sometimes mean the reviewer is right and the user's prior intuition was wrong. Let them decide.

If the user says yes, propose specific str_replace-style edits to the pr-atom-reviewer/SKILL.md file — show before/after, explain which contradiction each edit addresses. Don't apply them; the user applies them.

What this skill is NOT for

It is not for reviewing a single PR — that's pr-atom-reviewer's job. If the user asks "review my branch", redirect to that skill.
It is not for auto-updating SKILL.md — that's the user's call, always. This skill proposes, the user disposes.
It is not for "general code review retrospectives" — it's specifically about the pr-atom-reviewer skill's judgement, fed by the data it recorded.

Privacy and scope

This skill reads from Prime. Prime contains review metadata, verdicts, file paths, and atom behaviour sentences — but should NOT contain proprietary source code or diff contents (the reviewer's Step 7 enforces this). If you find code or diff content in Prime nodes, flag it to the user: "I see code content in Prime — this shouldn't be there. The reviewer skill should be only writing metadata."

Cadence suggestion

Run ingest-outcomes weekly (or whenever the user remembers — automated would be nicer but is out of scope here). Run cluster-vocab after every ~30 new reviews for a repo, or when joining a new repo to get a calibration baseline. Run find-drift monthly, or whenever the user feels the reviewer's calls are getting strange.

These are suggestions, not requirements. Don't push the user to run jobs they didn't ask for.