pge-review - SKILL.md Agent Skill

name: pge-review description: > Review-stage gate for changes since a fixed point. Standards axis checks repo conventions; Semantic Alignment axis checks against the originating plan/issue and user intent; Simplicity axis flags unnecessary complexity in new code; Verification checks whether evidence is strong enough to proceed. Use anytime — not limited to pge-exec. argument-hint: "<fixed-point: branch, commit, tag, or main>" allowed-tools:

Read
Write
Bash
Grep
Glob
Agent

inspired by: https://github.com/mattpocock/skills/blob/main/skills/in-progress/review/SKILL.md

PGE Review

Review-stage gate for the diff between HEAD and a fixed point:

Standards — does the code conform to this repo's documented conventions?
Semantic Alignment — does the code faithfully implement what was asked for, through the plan contract, without scope drift?
Simplicity — is the new code as simple as it can be without losing behavior?
Verification — is there enough evidence to proceed to prove-it / ship?

The standards, semantic alignment, and simplicity axes run as parallel sub-agents when the Review Depth Gate selects STANDARD or DEEP depth. For FAST reviews, main handles all review directly. Verification is aggregated by the main review at all depths.

pge-review is the Review stage in the Research → Plan → Execute → Review → Ship arc. It must return a gate route, not just a list of observations.

When review can resolve a matching .pge/tasks-<slug>/ task directory, it must write its durable output to .pge/tasks-<slug>/review.md. That task artifact is the default repair handoff back into pge-exec.

Review is responsible for checking:

diff still aligns with original user intent

The plan is the strongest implementation contract, with referenced issue files as supporting issue-local execution contracts when the plan uses issue-file shape. Research/user intent remains relevant when judging whether the plan was narrowed, expanded, or conceptually replaced during execution.

Process

1. Pin the fixed point

Use the argument as the fixed point (commit, branch, tag, main). If not provided, ask once: "Review against what?"

Capture: git diff <fixed-point>...HEAD and git log <fixed-point>..HEAD --oneline.

2. Review Depth Gate

After pinning the fixed point and before expensive source expansion or sub-agent spawning, classify the review depth. This gate uses changed files, diff shape, and obvious risk triggers — it does not require reading every alignment source up front.

Inputs:

changed files (from the diff)
diff size (file count, line count)
changed surface type (docs, implementation, contract, schema, API, config)
task/plan availability
risk triggers (see below)

Depth selection is priority ordered and conservative:

if any DEEP trigger appears -> DEEP
else if any STANDARD trigger appears -> STANDARD
else if all FAST conditions hold -> FAST
else -> STANDARD

If the risk level is unclear, upgrade rather than downgrade:

uncertain FAST vs STANDARD -> STANDARD
uncertain STANDARD vs DEEP -> DEEP

File count is only a weak hint. A one-file route/state/schema/API/manifest/skill-contract change is still DEEP.

FAST

Use only when ALL are true:

small diff, roughly 1-3 files
no public API, route/state/verdict, schema, manifest/config, artifact layout, security, persistence, or cross-module behavior
no meaningful downstream consumers
no explicit user request for deep review
no uncertainty about downstream consumers or contract impact
no plan/spec/contract source that needs semantic alignment review

Behavior:

main-thread review only, no sub-agent axes
still pin fixed point
still inspect the diff
still produce a Review Gate route
run Verification Story
if a Coherence trigger surface is touched, this is not FAST; upgrade to DEEP

STANDARD

Use for normal bounded work that is not FAST and does not hit a DEEP trigger:

multi-file but bounded change
plan/spec exists or semantic alignment matters, but the diff does not hit a DEEP trigger
one or two risk surfaces
no security or broad cross-module coupling
no route/state/schema/artifact-layout/public-API/manifest/handoff high-risk trigger

Behavior:

run only relevant axes (see Axis Selection below)
Standards axis if standards-sensitive files changed
Semantic Alignment axis if behavior/contract/spec alignment matters
Simplicity axis if implementation code changed
Coherence Pass if triggered

If there is any live plan/spec/contract source and the diff is not clearly trivial, prefer STANDARD over FAST unless the change is purely mechanical and carries no meaningful semantic alignment risk.

If Coherence Pass is triggered by a high-risk semantic surface, upgrade to DEEP. STANDARD may still run a small coherence-style check for low-risk shared behavior when producer/consumer/validator are obvious and local.

DEEP

Use when any high-risk trigger appears:

route/state/verdict vocabulary
schema or artifact layout
public API or CLI behavior
manifest/config consumed by tooling
handoff or skill contract
security/auth/data access/secrets/destructive behavior
persistence/migration/recovery semantics
cross-module shared behavior
large diff or unclear ownership

Behavior:

full axes (all sub-agents)
Coherence Pass required
stronger Verification Story
explicit producer / consumer / validator / evidence mapping
artifact output required when task directory exists

Axis Selection

Axes are conditional, not automatic:

Axis	Trigger
Standards	Standards-sensitive files, resident rules, config, manifest, style/convention surfaces
Semantic Alignment	Plan/spec/contract exists, behavior changed, acceptance/evidence changed, user intent could drift
Simplicity	Implementation code changed
Verification Story	Always (all depths)
Coherence Pass	Contract/API/route/state/schema/manifest/config/handoff/artifact/shared behavior changed

Skipped axes must be recorded with a short reason.

FAST has no sub-agent axes. If review needs a sub-agent axis, the selected depth is at least STANDARD.

Security review is layered:

deterministic patterns that can be checked mechanically belong in script/hook/CI candidates, not long prose rules
changed trust boundaries, secrets, auth, permissions, data access, external input, or destructive behavior trigger pge-review security scrutiny
broad threat modeling or recurring security workflow gaps should be reported as pge-learn candidates before adding new agents or checks

Output (recorded as the first section of review output, before Standards/Semantic Alignment/Simplicity/Verification Story):

## Review Depth
- review_depth: FAST | STANDARD | DEEP
- reason: <why this depth was selected>
- triggered_axes: <standards / semantic_alignment / simplicity / verification / coherence>
- skipped_axes: <axis + reason>
- coherence_required: yes | no

After depth classification, resolve only the alignment/standards sources needed by the selected depth and axes.

3. Identify the alignment source

Look for the originating intent/contract source in this order:

.pge/tasks-*/plan.md matching the current branch, task slug, or recent PGE work. Treat plan.md as the strongest implementation contract.
A path the user passed as argument, including a task directory or a specific plan/spec document.
Issue references in commit messages — fetch via gh when available.
Repo-local spec-like docs that clearly match the branch or task name: docs/, specs/, .scratch/, or .pge/tasks-*/research.md. Use research or handoff notes only to recover intent when no plan/spec exists.
If nothing specific is found, ask once. If no spec/intent source exists, skip the Semantic Alignment axis and say so.

Treat the selected source as the strongest available contract, not just background context. A contract may be a PGE plan, spec, issue, design document, current prompt, or other structured source that defines goal, scope, acceptance, verification, evidence, route, or downstream handoff semantics.

When the selected source is a PGE plan, extract its contract-bearing fields before review:

schema_version, source_contract_check, selected_approach, rejected_approaches, goal, non_goals, issues, target_areas, forbidden_areas, acceptance, verification, evidence_required, risks, terminal_conditions, plan_gate, stop_conditions, and route/state vocabulary.
Referenced issues/Ixxx.md files when plan.md ## issues is an Execution Index; use them as issue-local support, not as a replacement for plan-level acceptance.
Plan Engineering Review record or compact rationale when present, especially selected/rejected approach, scope drift check, issue slicing/coupling, and verification strategy.
Any cross-issue verification commands, grep scopes, evidence tables, or manual walkthrough checks.
Any templates, examples, evals, references, handoffs, or final response/output formats named by the plan or touched by the diff.

When a matching .pge/tasks-<slug>/research.md exists beside the plan, read the current research.v3 fields (goal, success_shape, scope, non_goals, constraints, relevant context, assumptions, candidate_direction, rejected_framings, open questions, conditional Implementation Friction, conditional Progressive Feasibility, and route) only as needed to detect semantic drift between original intent, plan, and diff. Do not recover review intent from obsolete Research-era field names; current research.v3, the current plan, and the current prompt outrank historical Research artifacts.

If the selected contract declares verification scope or evidence requirements, review must treat them as binding review inputs, not optional author notes. Recursive directory scopes include active calibration surfaces under that directory — templates/, examples, evals/, references/, handoffs/, and final response/output formats — unless the contract explicitly excludes them.

If the diff changes a contract-bearing surface, also inspect active calibration surfaces that could preserve old behavior: templates, examples, evals, references, handoffs, route/state/verdict definitions, and final response/output formats.

When review resolves a .pge/tasks-<slug>/ source, set:

task_dir: .pge/tasks-<slug>/
artifact_path: .pge/tasks-<slug>/review.md

Write the final review output there before the final response. This artifact is the durable repair seam for pge-exec bounded repair reruns.

4. Identify the standards sources

Collect from:

CLAUDE.md as the primary resident rules, plus AGENTS.md for repo routing/invariants when present.
CONTEXT.md, CONTEXT-MAP.md, or similar domain/context maps when present.
.pge/config/*.md, especially repo-profile.md and docs-policy.md.
docs/adr/ and focused repo docs that declare conventions or architectural decisions.
CONTRIBUTING.md, STYLE.md, STANDARDS.md, STYLEGUIDE.md, or similarly named project standards when present.
Linter/formatter configs (note but don't re-check what tooling enforces)

Do not treat broad research notes or historical handoffs as standards unless they explicitly define a current convention.

Standards review is semantic, not just lexical. When the diff changes workflow contracts, skill contracts, agent rules, routing, authority, artifact paths, or output formats, check whether the changed contract remains consistent with the current standards sources even when old terms no longer appear.

5. Review Contract

Every finding must carry a severity label:

Required — must be fixed before merge / ship
Important — likely reviewer-blocking unless explicitly deferred with rationale
Advisory — worthwhile improvement, not required
FYI — informational only

Use these labels consistently across all three axes. Avoid unlabeled findings.

Every finding must also carry confidence: high | medium | low. Confidence is about evidence strength, not severity. Low-confidence observations may be reported as FYI or parked context, but must not drive BLOCK_SHIP or NEEDS_FIX unless additional evidence raises confidence.

Correctness findings must also carry a concrete failure scenario whenever they assert runtime impact. Use a three-state verification discipline before surfacing them:

CONFIRMED — the review can name the input/state and wrong output, crash, lost behavior, or broken call path.
PLAUSIBLE — the mechanism is real but depends on realistic runtime state, configuration, timing, or environment.
REFUTED — the candidate is factually wrong, provably impossible, or already guarded by the changed code.

Only CONFIRMED and PLAUSIBLE correctness findings should be reported. Do not mark realistic boundary, concurrency, nil/undefined, falsy-zero, retry/partial-failure, regex, validation, or removed-guard bugs as speculative merely because they need a rare but reachable state. Do not report REFUTED candidates.

For implementation diffs, run a high-signal bug lane before routing. This is the final independent audit, not a substitute for pge-exec Generator/Evaluator quality gates. When routine in-contract defects reach this lane repeatedly, report that as execution-stage friction rather than broadening review scope.

Independent bug discovery — inspect changed code for compile/parse failures, clear logic errors, broken edge cases, security issues in changed trust-boundary code, and subtle regressions introduced by the diff.
Candidate validation — for every discovered bug candidate, re-check the claim against the diff, enclosing code, relevant callers/callees, and selected contract. Drop candidates that are pre-existing, linter-only, style-only, speculative, subjective, already guarded, or not reproducible from the changed code and reachable context.
Repair shaping — every surviving bug finding must include the smallest direct repair path. If the fix is small and self-contained, include a committable suggestion or precise replacement. If the fix is larger, describe the bounded repair without pretending it is a one-line patch.

This lane is stricter than ordinary advisory review. False positives erode trust; do not surface a bug unless it survives validation.

5.5 Review Gate

The review must end with one route:

BLOCK_SHIP — do not proceed to pge-challenge, PR, merge, or deploy.
NEEDS_FIX — fix bounded issues, then rerun review.
READY_FOR_CHALLENGE — review passed; proceed to pge-challenge for adversarial proof.
READY_TO_SHIP — only when the user explicitly requested review-only shipping readiness and prove-it evidence is already strong.

BLOCK_SHIP blocks shipping and challenge progression; it is not a terminal repair route. Required findings with scope: in-contract still go back to pge-exec repair review findings for <task-slug>.

Route rules:

Any Required finding → BLOCK_SHIP.
Any unresolved Important finding → NEEDS_FIX, unless explicitly deferred with rationale and owner.
Verification story weak → BLOCK_SHIP for behavior changes, NEEDS_FIX for docs-only or low-risk changes.
Semantic Alignment axis skipped because no spec/intent source exists → NEEDS_FIX unless the diff is clearly unplanned maintenance and the review states that assumption.
Standards source missing or stale → NEEDS_FIX if the diff touches workflow contracts, agent rules, build config, or shared interfaces.
Only Advisory/FYI findings with strong or partial verification → READY_FOR_CHALLENGE.
READY_TO_SHIP requires no Required/Important findings, strong verification story, and either a passed pge-challenge result or equivalent adversarial evidence in the review input.

The default successful route is READY_FOR_CHALLENGE, not READY_TO_SHIP.

Final response next is the next explicit stage recommendation, not an automatic invocation. For the default successful review route READY_FOR_CHALLENGE, output next: pge-challenge <task-slug>. Output next: ship only when the route is READY_TO_SHIP and the review input already contains passed pge-challenge evidence or equivalent adversarial proof. Do not use ship, done, or similar completion language for an ordinary passed review; Review passing means the Review gate is complete, not that Challenge/Ship are complete.

5.6 Task Artifact + Exec Repair Contract

Every review finding that could drive follow-up work must be execution-facing, not just reviewer-facing.

When task_dir is available, the task artifact must include a provenance block with the minimum repair identity fields:

source_run_id
reviewed_head
reviewed_base_ref or a resolved equivalent base commit identity
reviewed_diff_fingerprint or an equivalent stable diff identity

A task artifact is consumable for bounded repair only when that provenance block is present and all minimum fields resolve to the exact reviewed run/diff. Missing, placeholder, or partial provenance makes the artifact non-consumable for repair; fail closed and require a fresh review artifact instead of best-effort matching.

Required per-finding fields:

source: standards | semantic_alignment | simplicity | verification
severity: Required | Important | Advisory | FYI
confidence: high | medium | low
scope: in-contract | contract-change
bounded_fix: the smallest concrete bounded repair needed, or none
repair_suggestion: direct repair guidance, committable suggestion when safely self-contained, or none
evidence: exact spec/diff/verification citation supporting the finding
next_repair_path: pge-exec repair review findings for <task-slug> when scope: in-contract; route upstream to pge-plan when scope: contract-change

Scope classification rules:

in-contract: the fix stays inside the current plan contract and can be rerun as bounded repair work in pge-exec
contract-change: fixing it would change the plan contract itself — goal, scope, acceptance, target areas, verification, or non-goals

Default repair path:

Review findings go back to pge-exec as bounded repair input.
Only contract-change findings route upstream to pge-plan.

6. Spawn sub-agents (conditional on depth)

Skip this step entirely for FAST depth. For STANDARD, spawn only the axes triggered by the depth gate. For DEEP, spawn all three axes.

Standards agent brief:

Read the selected standards sources. Read the diff. Report every place the diff violates a documented current standard. Cite the standard (file + rule). Distinguish hard violations from judgement calls. Skip anything tooling enforces. Do not cite optional or historical docs as binding unless the main review identified them as current. Do not only search for changed terms; check whether changed contracts remain semantically consistent with standards sources, including ownership, routing, authority, artifact paths, and workflow invariants. Label every finding: Required / Important / Advisory / FYI. Under 400 words.

Semantic Alignment agent brief:

Read the selected plan/spec source and any adjacent research intent contract if provided. Treat the selected source's verification, evidence_required, target areas, acceptance, route/state vocabulary, templates/examples/evals/references/handoffs, and final response/output formats as review scope when they are declared by the contract or touched by the diff. Read the diff. Report: (a) plan requirements missing or partial; (b) behaviour not asked for (scope creep); (c) places where the plan no longer appears to preserve the original user/research intent; (d) evidence gaps where the diff may be correct but does not prove the contract; (e) active calibration surfaces that preserve old contract behavior; (f) route/state/verdict vocabulary mismatches between definitions, templates, handoffs, final responses, examples, and evals. Quote the plan/spec/research line for each finding. If no source exists, return "Semantic Alignment axis skipped: no spec or intent source found." Label every finding: Required / Important / Advisory / FYI. Under 400 words.

Simplicity agent brief:

Read the diff. For NEW code only (not pre-existing), flag: deep nesting (3+), long functions (50+ lines), generic names, dead code, unnecessary abstractions (single call site), over-engineered patterns (factory-for-factory, strategy-with-one-strategy), speculative flexibility (config for one value, abstract base with one impl). For each finding: file:line, signal, concrete "do this instead". Skip if simpler version would be harder to understand. Label every finding: Required / Important / Advisory / FYI. Under 400 words.

Correctness finder discipline for implementation diffs:

For each non-test hunk, read the changed hunk and the enclosing function. Ask what input, state, timing, or platform makes this code wrong. Audit deleted/replaced lines by naming the invariant or behavior they previously enforced, then check whether the new code re-establishes it. Trace changed functions to callers and relevant callees for new preconditions, return-shape changes, exceptions, or timing/order changes. Every candidate must include verification_status: CONFIRMED | PLAUSIBLE | REFUTED and a concrete failure scenario for CONFIRMED/PLAUSIBLE. Drop REFUTED candidates.

Bug validation brief:

Validate each candidate from the correctness finder before it can become a blocking finding. Confirm the bug is introduced or exposed by this diff, reachable under a concrete input/state, not already handled elsewhere, not a lint-only/style issue, and not merely a general test coverage concern. Return validated: yes | no, confidence: high | medium | low, failure_scenario, evidence, and smallest_repair. Drop validated: no candidates from the final findings.

7. Verification Story

Before finalizing the review, inspect the verification evidence behind the diff:

What tests were run?
Do those tests actually cover the changed behavior, or just exercise code paths superficially?
Is there manual verification evidence if the change affects UI / runtime behavior?
If the author/agent claims "tests pass", is that sufficient to prove the spec was implemented?

Report this as a separate section:

## Verification Story
- Evidence reviewed: <tests / commands / screenshots / none>
- Coverage judgment: strong | partial | weak
- Gaps: <what is still unproven>
- Gate impact: blocks | needs_fix | proceed

If the verification story is weak, surface it as a review finding even if the code looks plausible.

7.4 Coherence Pass (triggered)

When the diff changes a semantic contract surface, expand review from diff-local inspection to producer / consumer / validator / evidence coherence checking. This pass is bounded to the changed surface, not the whole repo.

Trigger surfaces:

semantic contracts (skill contracts, handoff schemas, artifact layouts)
route / state / verdict vocabulary
public APIs or CLI interfaces
schemas, manifests, or config consumed by other components
shared helpers or behavior with downstream consumers

When triggered, for each changed surface:

Identify the producer (what writes or defines the value/contract).
Identify the consumer(s) (what reads, executes, or depends on it).
Identify the validator (what accepts, rejects, or gates it).
Check the post-change artifact or behavior as a whole for the triggered contract — not only the changed lines.
Verify that producer output, consumer expectation, and validator acceptance still agree after the change.

Evidence rules:

Grep can support coherence evidence but cannot be the sole proof. A grep hit confirms a string exists; it does not confirm that producer, consumer, and validator still agree semantically.
The pass must inspect the post-change state of the contract, not only the diff hunks.

Findings: Coherence failures become normal review findings using the existing severity model (Required / Important / Advisory / FYI) and repair routing (in-contract / contract-change). A coherence failure where producer and validator disagree on allowed values is typically Required. A coherence failure where documentation lags behind implementation is typically Important.

Skip condition: If the diff does not touch any trigger surface, this pass does not run.

7.5 Main Cross-Contract Sweep

Before aggregating sub-agent results, main review must do one independent contract-aware sweep. This is not a validator script and not a replacement for sub-agents; it catches cross-file drift that individual axes can miss.

Check:

Standards source semantic consistency: changed contract ownership, routing, authority, artifact paths, and output formats still align with CLAUDE.md, AGENTS.md, README.md, and selected standards.
Contract-declared review scope: every plan/spec verification, evidence_required, cross-issue grep scope, or evidence table was actually reviewed or explicitly marked out of scope with rationale.
Vocabulary consistency: documented route/state/verdict names are representable in definitions, templates, handoffs, final responses, examples, and evals.
Calibration drift: active templates, examples, evals, references, and handoffs do not preserve behavior that the diff claims to replace.

Any failure found here becomes a normal review finding under standards, semantic_alignment, or verification with severity based on impact.

8. Aggregate

Present the Review Depth block first, then the axis sections. For FAST reviews, only Verification Story is present. For STANDARD/DEEP, present the triggered axes under ## Standards, ## Semantic Alignment, ## Simplicity, and ## Verification Story. Do not merge or rerank the axes — keep them separate. Record skipped axes with reasons.

When task_dir is available, write the final output to artifact_path before the final response.

End with:

## Review Depth
- review_depth: FAST | STANDARD | DEEP
- reason: <why this depth was selected>
- triggered_axes: <standards / semantic_alignment / simplicity / verification / coherence>
- skipped_axes: <axis + reason>
- coherence_required: yes | no

## Review Artifact
- task_dir: .pge/tasks-<slug>/ | not_available
- artifact_path: .pge/tasks-<slug>/review.md | not_available
- review_result: BLOCK_SHIP | NEEDS_FIX | READY_FOR_CHALLENGE | READY_TO_SHIP
- default_repair_path: pge-exec repair review findings for <task-slug> | route upstream to `pge-plan`

## Review Provenance
- source_run_id: <reviewed run_id or not_available>
- reviewed_head: <exact reviewed HEAD commit sha>
- reviewed_base_ref: <exact base ref used for review, or resolved equivalent base commit>
- reviewed_diff_fingerprint: <stable identity for the reviewed diff>

## Exec Repair Contract
| Finding ID | Source | Severity | Confidence | Scope | Bounded Fix | Repair Suggestion | Evidence | Next Repair Path |
|---|---|---|---|---|---|---|---|---|
| <id> | <standards / semantic_alignment / simplicity / verification> | <Required / Important / Advisory / FYI> | <high / medium / low> | <in-contract / contract-change> | <smallest concrete bounded repair or none> | <direct repair guidance or committable suggestion when safe> | <file:line / diff / verification citation> | <pge-exec repair review findings for <task-slug> / route upstream to `pge-plan`> |

## Review Gate
- route: BLOCK_SHIP | NEEDS_FIX | READY_FOR_CHALLENGE | READY_TO_SHIP
- reason: <one sentence>
- required_before_next: <fixes/evidence or "none">
- next: pge-exec repair review findings for <task-slug> | pge-challenge <task-slug> | ship only when route is READY_TO_SHIP | route upstream to `pge-plan`

## Summary
- Standards: <N findings> (required: X, important: Y, advisory: Z)
- Semantic Alignment: <N findings> (required: X, important: Y, advisory: Z)
- Simplicity: <N findings> (required: X, important: Y, advisory: Z)
- Verification story: strong | partial | weak
- Worst issue: <one-line description or "none">