code-review-methodology - SKILL.md Agent Skill

name: code-review-methodology description: "Conduct two-stage code review: Stage 1 verifies spec compliance (criterion-to-code mapping), Stage 2 evaluates security, correctness, performance, and maintainability across 6 parallel facets with P1/P2/P3 synthesis and deduplication by file:line. Use when reviewing code changes or pull requests. This skill MUST be consulted because reviewing quality on broken logic is wasted effort, and unmet acceptance criteria must block merge." allowed-tools: Read, Bash, Grep, Glob context: fork agent: Explore

Code Review Methodology

Domain skill for structured, multi-faceted code review.

Iron Law

FIRST VERIFY IT WORKS, THEN VERIFY IT'S GOOD. Never review code quality on code that doesn't function correctly.

Spec compliance is Stage 1. Code quality is Stage 2. Reviewing style on broken logic is wasted effort.

Two-Stage Review

Stage 1 — Spec Compliance: Does the code do what the issue/acceptance criteria require? Map each criterion to implementation evidence. If Stage 1 fails, stop — no point reviewing quality on code that doesn't meet requirements.

Stage 2 — Code Quality (in priority order):

Security — vulnerabilities, auth bypass, injection, secrets
Correctness — logic errors, race conditions, edge cases
Performance — O(n^2) in hot paths, unnecessary allocations, N+1 queries
Maintainability — readability, naming, structure (lowest priority)

Do NOT flag maintainability issues if security or correctness issues exist. Fix the important things first.

6-Facet Review

Every review evaluates these facets (parallelizable):

Facet	Focus	Agent / Skill
Security	OWASP top 10, secrets, auth/authz, input validation	security-reviewer
Quality	Logic correctness, edge cases	code-reviewer
Conventions	Commit format, branch naming, PR structure, patterns	convention-checker
Tests	Coverage, quality commands pass, test quality	test-runner
Error handling	Unhandled errors, silent failures, missing edge cases in error paths	error-handler-inspector
Claim verification	Self-review claims cross-referenced against actual file state	holdout-validation (skill)

Requirements compliance is Stage 1 (Spec Compliance) of the Two-Stage Review section above, not a parallel facet — it runs first on the main thread before the 6 facets fan out.

For the Tests facet specifically, see test-review-checklist.md for a runnable checklist of coverage, quality, and integration-test signals reviewers can flag with citations.

Finding Synthesis

After all facets complete, synthesize:

Deduplicate by file:line — same location = same finding, keep highest priority
Prioritize P1 → P2 → P3
Group by file for readability
Count findings per priority level

Requirements Compliance

Map each acceptance criterion to evidence:

Status	Meaning
Met	Directly implemented and testable
Interpreted	Criterion was ambiguous, implementation reflects interpretation
Partially Met	Some aspects done, others pending
Not Addressed	Not implemented in this change

Finding Format

### P1 - Critical
| Finding | Suggested Fix |
|---------|---------------|
| **1 · security · `auth.rb:42`**<br>SQL injection via string interpolation. | Use parameterized query. |

### P2 - Should Fix
| Finding | Suggested Fix |
|---------|---------------|

### P3 - Consider
| Finding | Suggested Fix |
|---------|---------------|

Confidence Assessment

For each finding, assess confidence:

High: Verified by reading code + running test
Medium: Verified by reading code
Low: Pattern match only — needs investigation

Only P1 findings with High confidence should block merge.

Signal Quality Rules

Signal Type	Confidence	Include In Review?
Verified by running code/test	High	Always
LSP diagnostic (error/warning from language server)	High	Always — language server has full project context
LSP find-references (verified all callers handled)	High	Always for P1/P2 — semantic, not text-based
Verified by reading code path	Medium	Always for P1/P2
Pattern-match only (looks like a bug)	Low	Only if P1, flag as "needs investigation"
Style preference	N/A	Only as P3, never blocks merge

Noise filter: If a finding cannot be explained with a file:line citation and a concrete scenario where it causes harm, it is noise. Drop it.

Boy Scout Recognition

When reviewing, recognize improve: commits as legitimate Boy Scout cleanup:

APPROVE improve: commits that pass the proximity test (file already modified, self-evidently correct, <10 lines, no API change, no explanation needed)
Flag as P2 "scope creep" only if the cleanup fails the proximity test (untouched files, architecture changes, new tests required, subjective style)

Review Cycle Awareness

Check review history to understand cycle count:

Count CHANGES_REQUESTED reviews to determine cycle number
Focus on delta since last review — findings on unchanged code from prior cycles are noise
On 3rd+ cycle: only flag NEW P1 findings, note persistent P2s, suggest synchronous discussion for unresolved items
Note convergence signal if findings are shrinking each cycle — this is healthy progress

Structured Cycle Parsing

Parse FLOW_REVIEW_CYCLE and FLOW_RESOLUTION_CYCLE markers from prior PR comments to build cycle context:

REPO=$(gh repo view --json nameWithOwner --jq '.nameWithOwner')

# Parse prior review findings (from review bodies)
gh api repos/$REPO/pulls/$PR_NUM/reviews --jq '
  [.[] | select(.body | test("FLOW_REVIEW_CYCLE")) | {
    cycle: (.body | capture("FLOW_REVIEW_CYCLE:(?<n>[0-9]+)") | .n),
    findings: (.body | capture("FINDINGS:\\[(?<f>[^\\]]+)\\]") | .f)
  }]'

# Parse prior resolution outcomes (from issue comments posted via gh pr comment)
gh api repos/$REPO/issues/$PR_NUM/comments --jq '
  [.[] | select(.body | test("FLOW_RESOLUTION_CYCLE")) | {
    cycle: (.body | capture("FLOW_RESOLUTION_CYCLE:(?<n>[0-9]+)") | .n),
    resolved: (.body | capture("RESOLVED:\\[(?<r>[^\\]]*?)\\]") | .r),
    escalated: (.body | capture("ESCALATED:\\[(?<e>[^\\]]*?)\\]") | .e)
  }]'

Cross-reference for each prior finding:

Was it marked as resolved in a resolution comment?
Has the code at that location changed in git diff?
Build a Previous Feedback Status table:

### Previous Feedback Status
| Cycle | Finding | Priority | Claimed Status | Verified |
|-------|---------|----------|----------------|----------|

If a finding was claimed resolved but the code at that location is unchanged, flag it as "Not verified — code unchanged".

Review Stop Conditions

Stage 1 finds >3 unmet acceptance criteria — REQUEST_CHANGES immediately, skip Stage 2
PR modifies files unrelated to the issue — flag as out-of-context, ask for split (but improve: commits in already-modified files are NOT out-of-context)
Diff is >500 lines with no test changes — flag as P1 "untested large change"

Review Decision

Findings	Decision
P1 findings (any)	REQUEST_CHANGES
P2 findings (any)	REQUEST_CHANGES
P3 findings only	COMMENT (fix-expected — author must fix in-PR; P3 is not a free pass)
No findings	APPROVE

Note: P3 → COMMENT is NOT "approve with nits." The PR author is expected to fix every P3 in-PR. Finding triage is NEVER a valid escalation trigger (see skills/llm-operator-principles/SKILL.md and references/escalation-format.md) — reviewers should not approve PRs with unaddressed P3s, and authors should not file escalations to ask whether to fix them.

Adversarial Protocol (Agent Teams)

When agent teams are enabled, use adversarial synthesis from team-coordination skill (skills/team-coordination/SKILL.md). Reviewers work independently, share findings, challenge each other, and disputed findings escalate to human.

Rationalization Prevention

Excuse	Response
"It looks correct to me"	Looking is not verifying. Trace the data flow.
"This is just a style issue"	Then it's P3 at most. Don't flag it as P2.
"I don't have time for all 6 facets"	Then prioritize: Security > Correctness > the rest. Never skip security.
"The tests pass so the logic is fine"	Tests prove what's tested. Review proves what's not.
"This is too small to review thoroughly"	Small changes, same process. Small bugs cause big outages.