name: code-review description: >- Multi-agent code review for Python projects. Use when reviewing local uncommitted changes, pull request diffs, or when asked to do a code review. Dispatches specialist agents (bug-hunter, security-auditor, test-coverage, code-quality, contracts, historical context) and aggregates results with confidence scoring.
Code Review
Comprehensive multi-agent code review adapted for Python codebases. Six specialist agents examine code from different angles — bugs, security, test coverage, code quality, contracts, and historical context — then a confidence-scoring phase filters out false positives before reporting.
Essential Principles
Each agent examines a different dimension. Agents run one at a time via the Task tool (Copilot does not support parallel sub-agents). Keep agents independent so their findings are not influenced by each other's results.
A review that reports 50 low-confidence nitpicks is worse than one that reports 3 high-confidence critical issues. Every finding goes through a confidence-scoring phase; only issues above threshold survive.
Check for Python-specific pitfalls: mutable default arguments, late-binding
closures, except Exception swallowing, unsafe pickle/yaml.load,
missing async/await, Pydantic validation gaps, and Django/FastAPI
framework patterns. Skip checks that don't apply to Python.
This project uses ruff for linting/formatting (line-length 120), pytest for tests, uv for dependency management, and Conventional Commits. Don't flag issues the existing toolchain already catches.
Vague advice ("consider improving error handling") wastes reviewer time. Cite the exact location, explain the consequence, and show a code fix.
When to Use
- Reviewing local uncommitted changes before committing
- Reviewing a pull request for bugs, security, and quality
- Running a pre-merge quality gate
- Investigating a specific file or module for issues
- When asked to "review", "audit", or "check" code
Example Prompts
These prompts will trigger the code-review skill:
Review a PR:
Review PR #42Do a code review of this PRRun code-review on PR #184Audit the changes in PR #42 for bugs and security issues
Review local changes:
Review my local changesCode review the uncommitted changesCheck my changes before I commit
Review specific files:
Review src/coodie/cql_builder.py for bugsSecurity audit the driver module
When NOT to Use
- Running linters or formatters — use
uv run pre-commit run --all-filesdirectly - Writing new tests — use the
test-refactoringskill or write them manually - Performance benchmarking — use the
benchmarksskill - Planning new features — use the
writing-plansskill
Review Mode Selection
What are you reviewing?
│
├─ Local uncommitted changes (git diff)
│ └─ Follow: workflows/review-local-changes.md
│
├─ A pull request (PR number given)
│ └─ Follow: workflows/review-pr.md
│
└─ Specific files (paths given)
└─ Treat as local changes review scoped to those files
Agent Architecture
Six specialist agents run independently during Phase 2 of each workflow:
| Agent | Focus | Reference |
|---|---|---|
| Bug Hunter | Runtime errors, None/null issues, race conditions, resource leaks | bug-hunter.md |
| Security Auditor | Injection, auth bypass, secrets exposure, unsafe deserialization | security-auditor.md |
| Test Coverage | Missing tests, weak assertions, untested error paths | test-coverage-reviewer.md |
| Code Quality | PEP 8, complexity, naming, SOLID, DRY, project conventions | code-quality-reviewer.md |
| Contracts | Pydantic models, type hints, API schemas, breaking changes | contracts-reviewer.md |
| Historical Context | Git blame, past bugs, hotspot files, architectural drift | historical-context-reviewer.md |
Agent Applicability
Not every agent needs to run on every review. Select based on what changed:
| Condition | Agents to Run |
|---|---|
| Any code changes | bug-hunter, security-auditor, code-quality |
| Test files changed | + test-coverage |
| Types/models/API changed | + contracts |
| Complex or high-risk changes | + historical-context |
| All (default) | All six agents |
Confidence Scoring
After agents report findings, each issue is scored on two axes:
Confidence (0–100): How certain is this a real issue?
| Score | Meaning |
|---|---|
| 0 | False positive — doesn't survive scrutiny |
| 25 | Might be real, might be false positive |
| 50 | Real issue, but minor or unlikely in practice |
| 75 | Verified real issue, will be hit in practice |
| 100 | Certain — evidence directly confirms it |
Impact (0–100): How bad is it if left unfixed?
| Score | Meaning |
|---|---|
| 0–20 | Minor code smell, no functional impact |
| 21–40 | Hurts maintainability or readability |
| 41–60 | Causes errors in edge cases or degrades performance |
| 61–80 | Breaks core features or corrupts data |
| 81–100 | Runtime crash, data loss, or security breach |
Filtering Thresholds
Higher-impact issues require less confidence to pass:
| Impact | Min Confidence | Rationale |
|---|---|---|
| 81–100 (Critical) | 50 | Critical issues warrant investigation even with moderate confidence |
| 61–80 (High) | 65 | High-impact issues need good confidence |
| 41–60 (Medium) | 75 | Medium issues need high confidence |
| 21–40 (Low-Medium) | 85 | Low-medium issues need very high confidence |
| 0–20 (Low) | 95 | Minor issues only if nearly certain |
False Positive Examples
These should be filtered out:
- Pre-existing issues in unchanged code
- Issues that ruff, ty, or pytest would catch (CI handles those)
- Pedantic nitpicks a senior engineer wouldn't flag
- Intentional functionality changes related to the broader change
- Issues silenced by
# noqa,type: ignore, or similar comments - General suggestions without specific evidence ("consider adding more tests")
Quick Reference: Python-Specific Checks
| Category | What to Look For |
|---|---|
| Mutable defaults | def f(items=[]) — use None + conditional |
| Broad except | except Exception / bare except: swallowing errors |
| Async pitfalls | Missing await, blocking calls in async functions |
| Pydantic | Missing validators, wrong model_config, mutable defaults in models |
| Type hints | Missing return types on public functions, Any overuse |
| Imports | Circular imports, unused imports, inline imports |
| Resource leaks | Files/connections not using context managers |
| Security | pickle.load, yaml.load (not safe_load), eval(), f-string SQL |
| Testing | Assertions without messages, assert True, missing parametrize IDs |
Reference Index
| File | Content |
|---|---|
| bug-hunter.md | Bug detection agent instructions — root cause analysis, Python-specific patterns |
| security-auditor.md | Security audit agent — OWASP, Python-specific vulnerabilities |
| test-coverage-reviewer.md | Test coverage agent — pytest patterns, behavioral coverage |
| code-quality-reviewer.md | Code quality agent — PEP 8, SOLID, complexity, project conventions |
| contracts-reviewer.md | Contracts agent — Pydantic, typing, API design, breaking changes |
| historical-context-reviewer.md | Historical context agent — git archaeology, hotspots, past patterns |
| Workflow | Purpose |
|---|---|
| review-local-changes.md | 3-phase workflow for reviewing uncommitted local changes |
| review-pr.md | 3-phase workflow for reviewing a pull request |
Success Criteria
A well-executed code review:
- Ran all applicable specialist agents
- Every finding includes file path, line number, and concrete fix
- Confidence scoring filtered out false positives (threshold applied)
- No issues flagged that ruff/ty/pytest would already catch
- Python-specific patterns checked (mutable defaults, async, Pydantic)
- Project conventions (AGENTS.md, pyproject.toml) respected
- Critical/High issues clearly separated from suggestions
- Output follows the structured report template