pre-push-llm-code-review - SKILL.md Agent Skill

name: Pre-Push LLM Code Review description: Pre-push code review checklist and protocol for catching systematic LLM coding failures — overcorrection, hallucinated logic, silent vulnerabilities, missing edge cases, and context-ignorant reviews. Grounded in arXiv research on LLM code review failure modes. ...

Pre-Push LLM Code Review

Systematic checklist for reviewing code before pushing. Catches failure modes specific to LLM-generated code.

Based on arXiv research: 2603.00539, 2604.16697, 2604.17014, 2604.19825, 2601.19072, 2603.18740, 2512.18020, 2511.07017.

Why This Exists

LLMs have systematic blind spots that differ from human mistakes:

They hallucinate execution traces and confidently validate buggy code (SolidCoder, 2604.19825)
They know vulnerabilities exist but still generate them — the "Format-Reliability Gap" (2604.16697)
They overcorrect correct code, flagging working implementations as broken (2603.00539)
They produce functionally correct but silently vulnerable code that static analyzers miss (2604.17014)
They generate review comments ungrounded in actual changes (HalluJudge, 2601.19072)
They are biased by surrounding context — miss vulnerabilities when code "looks right" (2603.18740)

Protocol

Run this as a structured pass over your diff before pushing. Each section is independent — complete all.

PASS 1: Execution Grounding

Goal: Don't imagine — verify. The #1 LLM failure is hallucinating correct behavior.

Every new function/method has a test or is called by one. No untested new logic.
Edge cases are tested, not assumed. LLMs skip boundary conditions (empty input, null, zero, max, off-by-one). Check: does the code handle the absence of data?
Types match at boundaries. LLMs frequently mismatch types at function boundaries (optional vs required, null vs undefined, number vs string). Trace data flow across module boundaries.
Async/await is correct. Missing await, unhandled promise rejection, race condition in concurrent operations. These pass visual inspection but fail at runtime.
Error paths are tested. LLMs generate happy paths reliably but skip error handling. Test the failure branch.

Red flag: If you're reading code and "it looks right" without running it — that's the Mental-Reality Gap. Execute it.

PASS 2: Security Surface

Goal: Catch the "Format-Reliability Gap" — code that works but is silently vulnerable.

Input validation on all external-facing surfaces. API endpoints, user inputs, file reads, environment variables. LLMs trust inputs they shouldn't.
No hardcoded secrets or keys. Check for accidental commits of API keys, tokens, connection strings.
SQL/query injection vectors. String interpolation in queries, unsanitized user input in commands.
Authentication/authorization checks. Does every protected route/mutation verify identity? LLMs frequently add routes without auth guards.
Secrets in logs and error messages. LLMs log full objects (including credentials) for "debugging." Strip PII before logging.
Dependency safety. No new dependencies without checking: is it maintained? Does it pull in transitive junk?

Red flag: "Functionally correct but vulnerable" is the LLM default state (2604.17014). If you only tested the happy path, you haven't tested security.

PASS 3: Contextual Correctness

Goal: Verify the change does what was actually requested — not what the LLM assumed.

Change matches the requirement/spec. LLMs overcorrect — they "fix" things that aren't broken and add features nobody asked for. Compare diff against the original request.
No scope creep. Every changed line traces to the stated requirement. If you can't draw a line from a changed line to the requirement, revert it.
No speculative code. "This might be useful later" = dead code now. Remove it.
Existing behavior preserved. Run the existing test suite. If anything breaks, the change is wrong regardless of how good the new code looks.
Import cleanup. LLMs add imports they don't use and forget to add imports they need. Check both directions.

Red flag: If the diff is bigger than expected, the LLM probably went beyond scope. Shrink it.

PASS 4: Structural Quality

Goal: LLMs generate structural anti-patterns that pass review but degrade over time.

No god functions. If a function exceeds ~50 lines, it's doing too much. LLMs love giant functions that "handle everything."
No duplicated logic. LLMs copy-paste similar patterns instead of extracting shared logic. Check for near-duplicate code blocks in the diff.
Naming matches intent. LLMs generate generic names (data, result, item, info). If you can't tell what a variable holds from its name, rename it.
No premature abstractions. Single-use helper functions, unnecessary wrapper classes, config objects with one key. YAGNI.
Comments explain why, not what. LLMs comment obvious code and leave complex logic uncommented. Delete noise comments, add missing why-comments.
Error messages are actionable. "Something went wrong" is useless. Errors should tell the developer/user what failed and how to fix it.

Red flag: "Clean" code that took 200 lines to do what should take 50 is a sign of LLM over-engineering.

PASS 5: Integration Points

Goal: Code works in isolation but breaks when connected.

API contracts match. If you changed a function signature, check every caller. LLMs update definitions but miss call sites.
Database schema migrations are safe. No destructive column drops without a migration path. No assuming data exists that doesn't.
Environment variable changes documented. New env vars need to be added to .env.example, deployment config, etc.
Backwards compatibility. If this is an API change, do existing consumers still work? LLMs break contracts silently.
File moves/renames don't break imports. If you moved a file, every import path everywhere needs updating.

Red flag: "It works on my machine" = untested integration. If you didn't verify the full flow, it's unverified.

Quick Mode (for small changes)

For diffs under ~50 lines, use this abbreviated checklist:

Does the change match the requirement? (no scope creep)
Is there a test covering the new behavior?
Are types correct at boundaries?
Any security surface exposed?
Do existing tests still pass?

If any answer is unclear, escalate to full protocol.

Anti-Patterns to Watch For

These are specific LLM-generated patterns that look correct but are wrong:

Pattern	Why It's Wrong	What To Do
`try { ... } catch(e) { console.log(e) }`	Swallows errors silently. User sees nothing, debugging impossible.	Handle or re-throw. At minimum, log to monitoring.
`if (data) { use(data.field) }`	Truthy check doesn't validate shape. `data = { field: undefined }` passes.	Validate specific fields or use schema validation.
Giant try/catch wrapping entire functions	Masks which operation failed. Catches unrelated errors.	Narrow catch to specific operations.
`any` type assertions / `as unknown as T`	Defeats type system. Hides real bugs.	Fix the types properly.
Optional chaining chains (`a?.b?.c?.d`)	Hides null bugs instead of fixing them. Fails silently.	Validate early, fail explicitly.
`JSON.parse(JSON.stringify(obj))` for deep clone	Loses functions, dates, undefined, circular refs.	Use structured clone or explicit mapping.
Regex for HTML/parsing	Fragile, doesn't handle edge cases, security risk.	Use a proper parser.
Comments that restate code	`// increment counter` above `counter++`	Delete. Add comments only for non-obvious logic.
`useEffect` for derived state	Re-renders on every change, race conditions.	Use `useMemo` or compute inline.
Hardcoded wait/sleep for async	Timing-dependent, flaky, slow.	Use proper async primitives (polling, events, etc.)

Execution

Before pushing:

# 1. Run existing tests
npm test  # or equivalent

# 2. Run linter/type checker
npm run lint && npm run typecheck  # or equivalent

# 3. Check diff size — if >500 lines, consider splitting
git diff --stat main

# 4. Review the actual diff with the checklist above
git diff main

# 5. Verify no secrets in diff
git diff main | grep -iE '(api_key|secret|token|password|credential)' || echo "clean"

If any checklist item fails, fix before pushing. No exceptions.

Research Sources

Paper	Key Finding	arXiv ID
SolidCoder	Mental-Reality Gap — models hallucinate execution traces	2604.19825
Surgical Repair	Format-Reliability Gap — models know vulns but still generate them	2604.16697
False Security Confidence	Functionally correct code is silently vulnerable	2604.17014
LLM Code Reviewers Overcorrect	More detailed prompts = worse judgment	2603.00539
HalluJudge	Review comments ungrounded in actual code	2601.19072
Contextual Bias in Security Review	Vulnerability detection biased by surrounding code	2603.18740
LLM Code Smells	60.5% of systems have LLM-specific anti-patterns	2512.18020
Code Review Benchmarks Survey	Missing semantic context = missing real issues	2602.13377
SGCR	Spec-grounded review improves reliability	2512.17540
RovoDev (Atlassian)	Enterprise code review at scale — context is everything	2601.01129
Triage	Code health metrics predict which model tier is needed	2604.07494