agent-review-panel

star 19

Orchestrate a multi-agent adversarial review panel where several Claude Code subagents with different perspectives independently review a piece of work, debate with each other, reach (or fail to reach) consensus, then a supreme judge renders the final verdict. Use this skill whenever the user asks for a "review panel", "multi-agent review", "adversarial review", "have agents debate this", "review with multiple perspectives", "panel review", "get different opinions on this code/plan/doc", or invokes /agent-review-panel. Also trigger when a user says things like "I want thorough feedback from different angles", "stress-test this design", "red team this", "get a second (third, fourth) opinion", "fresh eyes on this", "multiple reviewers", "devil's advocate perspective", "every angle covered", "I want agents to argue pros and cons", "independently evaluate", "critical look from security and performance angles", "high-stakes — cover every angle", or "debate the pros and cons". This skill is specifically about launc

wan-huiyan By wan-huiyan schedule Updated 6/6/2026

name: agent-review-panel description: > Orchestrate a multi-agent adversarial review panel where several Claude Code subagents with different perspectives independently review a piece of work, debate with each other, reach (or fail to reach) consensus, then a supreme judge renders the final verdict. Use this skill whenever the user asks for a "review panel", "multi-agent review", "adversarial review", "have agents debate this", "review with multiple perspectives", "panel review", "get different opinions on this code/plan/doc", or invokes /agent-review-panel. Also trigger when a user says things like "I want thorough feedback from different angles", "stress-test this design", "red team this", "get a second (third, fourth) opinion", "fresh eyes on this", "multiple reviewers", "devil's advocate perspective", "every angle covered", "I want agents to argue pros and cons", "independently evaluate", "critical look from security and performance angles", "high-stakes — cover every angle", or "debate the pros and cons". This skill is specifically about launching multiple reviewer agents with distinct personas who discuss and debate — NOT for single-reviewer code review, quick sanity checks, bug fixes, deployment tasks, addressing existing PR comments, skill improvement, peer review, code explanation, or writing tests. Supports "deep research mode" when user says "deep review", "thorough review", "research review", or passes "deep" to /agent-review-panel — adds web research for domain best practices before launching reviewers. Supports "multi-run union mode" when user says "multi-run review", "run N times and merge", "run twice", "run 3 times", or "maximum coverage review" — repeats the panel with rotated persona sets and merges results with stability scoring. Supports "data flow trace tiers" (Standard/Thorough/Exhaustive) when user says "thorough review", "exhaustive review", "trace everything", or "catch all bugs" — dedicates a pre-review phase to tracing data through critical paths and flagging composition/seam bugs.

Agent Review Panel v3.5.0

A multi-agent adversarial review system based on nine research foundations: ChatEval (ICLR 2024), AutoGen, Du et al. (ICML 2024), MachineSoM (ACL 2024), DebateLLM, DMAD (ICLR 2025), "Talk Isn't Always Cheap" (ICML 2025), CONSENSAGENT (ACL 2025), Trust or Escalate (ICLR 2025 Oral).

When NOT to Use This Skill

Do NOT trigger for these requests — they need single-agent handling or other skills:

  • Single code review ("review this function for bugs")
  • Quick sanity checks ("just a quick look before I push")
  • Bug fixes ("fix the type errors", "fix the failing test")
  • Peer review without multi-perspective signal ("peer review this doc")
  • Code explanation ("what does this code do?")
  • Deployment tasks ("deploy to staging")
  • Addressing existing feedback ("address the PR comments")
  • Skill improvement ("make this skill better") → use schliff
  • Writing tests, READMEs, or documentation
  • Asking for a single opinion ("what do you think?", "is this any good?")

The key signal is multiple independent perspectives — if the user wants one opinion, don't launch a panel.

Input

This skill takes as input one or more of: file paths to review, inline code/text in the conversation, a git diff or PR reference, or a plan/design document. It expects the user to specify (or let it auto-detect) what to review.

Dependencies

This skill depends on the Agent tool to launch parallel subagent reviewers and requires bash for context gathering (grep, file reads). All agents MUST use model: "opus". This includes VoltAgent specialist agents launched via subagent_type — always pass model: "opus" explicitly alongside subagent_type to override the agent's default model. Omitting it causes the launched agent to fall through to its own frontmatter-declared model (which may be sonnet or haiku), introducing cross-run reasoning variance. Knowledge mining reads from memory paths if they exist; if not available, it degrades gracefully — no hard dependency.

HTML report CDN dependencies (Phase 15.3 output file only): The generated review_panel_report.html loads Tailwind CSS, Chart.js, and — new in v2.15 — Prism.js from CDN for syntax highlighting in the Code Evidence sections of expandable issue cards. If the CDNs are unreachable, the HTML degrades gracefully: layout and text remain readable, charts show a placeholder, code blocks render as unstyled monospace.

Optional enhancement: When VoltAgent specialist agents are installed, the panel can use them instead of generic persona-prompted agents for stronger domain-specific reviews. See "VoltAgent Integration" section below.

This skill is scoped to multi-perspective adversarial review. For skill improvement requests, use schliff instead. For post-review plan updates, use plan-review-integrator. Supported versions: Claude Code v1.0+.

Examples

Example 1: Code review panel Input: "Do a review panel on src/auth/middleware.ts — I want multiple perspectives before merging" Output: Classifies as pure code → selects Correctness Hawk + Architecture Critic + Security Auditor + Devil's Advocate → gathers context → 4 parallel reviewers → 2 debate rounds → completeness audit → claim verification → supreme judge → writes review_panel_report.md

Example 2: Mixed content with deep research Input: "Deep review of our migration plan — it includes SQL and Terraform" Output: Classifies as mixed → adds Code Quality Auditor + Data Quality Auditor (SQL signal) + Reliability/SRE (infra signal) → runs web research for best practices → full panel → report with epistemic labels

Process Overview

Phase 1:    Setup                     → Identify work, pick personas, define criteria
Phase 2:    Data Flow Trace           → Trace critical path(s), document schemas [code only] (v2.14)
Phase 3:    Independent Review        → All reviewers evaluate in parallel (no cross-talk)
Phase 4:    Private Reflection        → Each reviewer re-reads source, rates own confidence
Phase 5:    Debate (rounds 1–3)       → Reviewers engage with each other + find new issues
Phase 6:    Round Summarization       → Distill resolved/unresolved points between rounds
Phase 7:    Blind Final               → Each reviewer gives final score independently
Phase 8:    Completeness Audit        → Dedicated agent scans for what the panel missed
Phase 9:    Verify Commands           → Run up to 5 reviewer verification commands (advisory)
Phase 10:   Claim Verification        → Verify all line-number citations against source
Phase 11:   Severity Verification     → Read actual code for every P0/P1, downgrade if overstated + web-verify external domain claims (v2.16.3)
Phase 12:   Verification Tier Assign  → Confidence draft (12a) + judge-advised refinement (12b)
Phase 13:   Targeted Verification     → Persona-matched agents dispatched per dispute point
Phase 14:   Supreme Judge             → Opus arbitrates everything including verification round
Phase 14.5: Post-Judge Verification   → Re-verify judge-introduced P0/P1 against ground truth (v3.2.0)
Phase 15:   Output Generation         → (parent) Three output files (all sequential: 15.1 → 15.2 → 15.3)
  Phase 15.1: Primary Markdown Report → Structured markdown summary (review_panel_report.md)
  Phase 15.2: Process History         → Full director's-cut log (review_panel_process.md)
  Phase 15.3: HTML Report             → Interactive dashboard (review_panel_report.html)

[Multi-Run mode (--runs N > 1): repeat Phases 2–15 with rotated personas, then:]
Phase 16:   Merge                     → Deduplicate, score stability, produce merged report (v2.14)

Phase 1: Setup

Identify the Work

The user provides: file paths, inline content, git diff/PR, or a plan/design doc. Collect full content, then run Context Gathering (below).

Classify content type (matters for persona selection):

  • Pure code — only code files
  • Pure plan/design — architecture docs, proposals, RFCs
  • Mixed — plans with code snippets, SQL, or config
  • Documentation — READMEs, guides, API docs

Review Mode Detection (v2.8)

Auto-detect review mode from content type. No user toggle.

Content Type Review Mode Behavior
Pure code Precise Every finding MUST cite a specific file, line number, or code snippet. Findings without concrete evidence are demoted to [UNVERIFIED].
Pure plan/design Exhaustive Broader risk identification allowed. Findings may reference design sections or architectural patterns without line-number evidence.
Mixed Precise for code, Exhaustive for prose Reviewers label each finding with its mode. Code findings without line citations are demoted.
Documentation Exhaustive Same as plan/design.

The detected mode is injected into Phase 3 reviewer prompts and the judge prompt. Report header states the detected mode.

Detect Content Signals

Scan work for technology-specific signals (case-insensitive, 3+ keyword threshold). See references/signals-and-checklists.md for the full detection table and domain checklists. Signal detection only fires when auto-selecting personas.

Context Gathering

Run these steps before launching reviewers for file-path reviews. Skipping is the #1 cause of incorrect [CRITICAL] recommendations.

  1. Sibling Directory Scan — From reviewed files' parent, scan for docs/, README*, CLAUDE.md, config.py, package.json, etc. Read first 50 lines of each. If files are nested, scan both immediate parent and project root.

  2. Reference Tracing — Scan for imports, config references, cross-file references in comments, SQL table references, file path strings.

  3. Safety Mechanism Discovery — Grep reviewed code + imports for: _valid, _flag, _guard, _check, _mask, <= target_date, BETWEEN, fillna, COALESCE, try/except, DELETE FROM, MERGE, WRITE_TRUNCATE, upsert, idempoten, --dry-run, duplicate, assertion. Note what each guards against. Critical: When a finding claims "X is missing", verify the claim by grepping the actual code — existing safety mechanisms are the #1 thing panels miss (v2.6 benchmark: panel claimed "non-idempotent writes" but DELETE-then-INSERT with duplicate validation already existed).

3b. Temporal Scope Verification — When the work contains ANY temporal claims (e.g., "excludes Christmas", "masks winter period", "filters out weekends", "pre-period starts after X"), verify that the exclusion applies to ALL instances across the full date range, not just the first/most-obvious one. Common failure: "excludes Christmas" via a Jan 6 start date only excludes the first Christmas — a second Christmas 12 months later may still be in the training window. This class of bug evaded 3 rounds of adversarial review (12 reviewers) in a real engagement — the user caught it 6 days later. Inject into reviewer prompts: "For any temporal exclusion claim, count how many instances of the excluded event exist in the date range and verify ALL are excluded, not just one."

3c. Codebase State Check (v2.10) — When reviewing code that lives in a git repository, determine the exact codebase state being reviewed. This prevents the panel from flagging code as "missing" when it exists on main but not in the reviewed branch/worktree.

Why this matters: In a real engagement, a 4-reviewer panel + completeness auditor unanimously flagged a class as "non-existent" — but it existed on main (merged via a PR after the worktree branched). All reviewers checked the worktree files, none checked main. The finding was confidently wrong.

Steps:

# 1. Detect if we're in a worktree
git rev-parse --is-inside-work-tree 2>/dev/null && \
WORKTREE=$(git rev-parse --show-toplevel) && \
BRANCH=$(git rev-parse --abbrev-ref HEAD)

# 2. Find the default branch (main or master)
DEFAULT_BRANCH=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^refs/remotes/origin/@@') || DEFAULT_BRANCH="main"

# 3. Find the branch point and count divergence
MERGE_BASE=$(git merge-base HEAD origin/$DEFAULT_BRANCH 2>/dev/null)
COMMITS_BEHIND=$(git rev-list --count HEAD..origin/$DEFAULT_BRANCH 2>/dev/null || echo "unknown")

# 4. List PRs/commits merged to main since branch point
git log --oneline $MERGE_BASE..origin/$DEFAULT_BRANCH 2>/dev/null | head -20

If commits_behind > 0: Include a [STALE_BRANCH] warning in the context brief listing what was merged to main since the branch point. Inject into ALL reviewer prompts: "The code under review is {N} commits behind {default_branch}. These changes were merged since: {list}. Before claiming code or features are 'missing', check whether they exist on {default_branch} via git show {default_branch}:{filepath}."

If in a git worktree specifically (detected via git worktree list): add extra emphasis — worktrees are commonly used for isolated development and are especially prone to divergence from main.

Record in Context Brief: Add a "Codebase State" section with: branch name, commits behind main, key PRs merged since branch point, worktree status.

  1. Knowledge Mining (tiered loading) — Mine local knowledge using a 3-tier approach to minimize token waste while maximizing relevant context:

    L0 — Index scan (~100 tokens each). Read only index lines and frontmatter description fields. Filter for relevance to the work under review.

    • MEMORY.md — read index lines only (each ~150 chars). Match keywords from the work's content type, domain, and technology signals.
    • ~/.claude/skills/*/SKILL.md — read only the YAML description: field (glob + grep for ^description:). Match against detected content signals.
    • CLAUDE.md — always load (small, high-authority).

    L1 — Summary scan (~500 tokens each). For L0-matched items only, read frontmatter + first paragraph to confirm relevance.

    • Memory files (feedback_*.md, project_*.md) — read first 20 lines. feedback_*.md files matching the review domain get automatic L2 promotion (past corrections are HIGHEST PRIORITY).
    • Skill files — read the description: + ## When NOT to Use section.
    • lessons.md — scan for lines matching review domain keywords.

    L2 — Full content (no limit). Only for confirmed-relevant items from L1.

    • Read the complete file for items that passed L1 relevance check.
    • Typical yield: 3-8 files at L2 out of 50+ candidates at L0.

    Deduplication: if the same insight appears in multiple sources, include only the most specific version (project > global > skill).

  2. Web Research (deep research mode only) — Triggers when user requests "deep review" or 5+ keywords from a signal group with no built-in checklist. Cap at 2 web searches. Tag findings with [WEB]. If the built-in domain checklist already covers the signal group, web research is skipped unless explicitly requested.

  3. Context Brief — Compile into structured brief with sections: Codebase State, System Documentation Found, Referenced Files, Safety Mechanisms, Knowledge Mining Results, Web Research Findings, Domain Checklist, Context Gaps.

  4. User Confirmation — If significant context gaps exist or deep research is available but not requested, ask before proceeding.

Select Personas with Agreement Intensity

If user specifies personas, use those. Otherwise select 4 from content-type sets:

For code/implementation:

  1. Correctness Hawk (30%) — Bugs, logic errors, edge cases
  2. Architecture Critic (50%) — Design patterns, coupling, extensibility
  3. Security Auditor (30%) — Vulnerabilities, injection, auth gaps
  4. Devil's Advocate (20%) — Challenges everything, proposes alternatives

For plans/designs (pure — no code):

  1. Feasibility Analyst (60%) — Technical feasibility, timeline realism
  2. Stakeholder Advocate (50%) — Business perspective, ROI
  3. Risk Assessor (30%) — Failure modes, dependencies
  4. Devil's Advocate (20%)

For mixed content (plans WITH code/SQL/config) — CRITICAL:

  1. Feasibility Analyst (60%)
  2. Code Quality Auditor (40%) — Line-by-line scrutiny of every snippet
  3. Risk Assessor (30%)
  4. Devil's Advocate (20%)

For documentation:

  1. Clarity Editor (60%)
  2. Technical Accuracy Reviewer (30%)
  3. Completeness Checker (40%)
  4. Devil's Advocate (20%)

After base selection, auto-add signal-detected personas (up to 6 total). Replace Devil's Advocate first if at cap (keep ≥1 DA if panel ≥4).

CRITICAL: If work contains ANY code/SQL/config snippets, always include Code Quality Auditor — the #1 cause of missed details in v1.

Reasoning Strategy Assignment (DMAD, ICLR 2025)

Persona Type Strategy Injection
Correctness Hawk / Code Quality Auditor Systematic enumeration "Enumerate every code path, constant, edge case."
Architecture Critic / Feasibility Analyst Backward reasoning "Start from desired outcome, trace backward."
Security Auditor / Risk Assessor Adversarial simulation "Imagine you are an attacker. How would you break this?"
Devil's Advocate Analogical reasoning "Compare to known failure patterns from similar projects."
Stakeholder Advocate / Clarity Editor First-principles "Question every assumption from scratch."
Auto-added specialists Checklist verification "Use your domain checklist. Verify each item."

Default Evaluation Criteria

Correctness, Completeness, Quality, Edge Cases (override if user specifies).

VoltAgent Integration (v2.9)

VoltAgent specialist agents (130+ across 10 families) have built-in domain expertise via their system prompts, making them stronger reviewers than generic persona-prompted agents. When available, the panel should upgrade personas to VoltAgent agents. Full catalog: github.com/VoltAgent/awesome-claude-code-subagents (a point-in-time snapshot of every agent the mapping tables below may reference is vendored at references/voltagent-catalog.json; run scripts/refresh-voltagent-catalog.sh to regenerate it and scripts/voltagent-catalog-check.sh to detect drift).

Step 1: Check availability. During Phase 1 setup, check whether VoltAgent agents are available by scanning the system-reminder agent list for any voltagent-* prefixed agents. Note which families are installed.

Step 2: Map personas to specialists. Use this mapping table. (When launching any persona via subagent_type, ALWAYS pass model: "opus". v2.14.)

Core Persona Mapping (review panel built-in personas)

Persona Primary VoltAgent Alt VoltAgent Fallback
Correctness Hawk voltagent-qa-sec:code-reviewer voltagent-qa-sec:debugger Generic + prompt
Architecture Critic voltagent-qa-sec:architect-reviewer voltagent-infra:cloud-architect Generic + prompt
Security Auditor voltagent-qa-sec:security-auditor voltagent-qa-sec:penetration-tester Generic + prompt
Code Quality Auditor voltagent-qa-sec:code-reviewer Generic + prompt
Feasibility Analyst voltagent-data-ai:data-scientist voltagent-biz:business-analyst Generic + prompt
Risk Assessor voltagent-qa-sec:chaos-engineer voltagent-domains:risk-manager Generic + prompt
Performance Specialist voltagent-qa-sec:performance-engineer voltagent-infra:sre-engineer Generic + prompt
Stakeholder Advocate voltagent-biz:product-manager voltagent-biz:business-analyst Generic + prompt
Devil's Advocate Generic + prompt (intentionally generic)
Data Quality Auditor voltagent-data-ai:data-analyst voltagent-data-ai:data-engineer Generic + prompt
Reliability/SRE voltagent-infra:sre-engineer voltagent-infra:devops-incident-responder Generic + prompt
DevOps/Infra voltagent-infra:devops-engineer voltagent-infra:platform-engineer Generic + prompt
Database Specialist voltagent-data-ai:database-optimizer voltagent-data-ai:postgres-pro Generic + prompt
Clarity Editor voltagent-dev-exp:documentation-engineer voltagent-biz:technical-writer Generic + prompt
Technical Accuracy voltagent-qa-sec:code-reviewer Generic + prompt
Completeness Checker voltagent-qa-sec:qa-expert Generic + prompt

Signal-Detected Specialist Mapping (auto-added by content signals)

When content signals trigger auto-addition of specialist reviewers, use these VoltAgent agents instead of generic personas:

Content Signal Auto-Add Persona VoltAgent subagent_type
SQL / database queries Data Quality Auditor voltagent-data-ai:database-optimizer
Terraform / IaC Infrastructure Reviewer voltagent-infra:terraform-engineer
Terragrunt Infrastructure Reviewer voltagent-infra:terragrunt-expert
Docker / containers Container Reviewer voltagent-infra:docker-expert
Kubernetes / k8s K8s Reviewer voltagent-infra:kubernetes-specialist
CI/CD / pipelines Pipeline Reviewer voltagent-infra:deployment-engineer
ML / model training ML Reviewer voltagent-data-ai:ml-engineer
LLM / prompts LLM Reviewer voltagent-data-ai:llm-architect
NLP / text processing NLP Reviewer voltagent-data-ai:nlp-engineer
React / frontend Frontend Reviewer voltagent-lang:react-specialist
TypeScript TS Reviewer voltagent-lang:typescript-pro
Python Python Reviewer voltagent-lang:python-pro
Go / Golang Go Reviewer voltagent-lang:golang-pro
Rust Rust Reviewer voltagent-lang:rust-engineer
Java / Spring Java Reviewer voltagent-lang:java-architect
.NET / C# .NET Reviewer voltagent-lang:csharp-developer
Ruby / Rails Rails Reviewer voltagent-lang:rails-expert
PHP / Laravel PHP Reviewer voltagent-lang:laravel-specialist
Swift / iOS iOS Reviewer voltagent-lang:swift-expert
Flutter / Dart Flutter Reviewer voltagent-lang:flutter-expert
GraphQL GraphQL Reviewer voltagent-core-dev:graphql-architect
WebSocket / real-time Real-time Reviewer voltagent-core-dev:websocket-engineer
Microservices Architecture Reviewer voltagent-core-dev:microservices-architect
API design API Reviewer voltagent-core-dev:api-designer
Network / DNS / routing Network Reviewer voltagent-infra:network-engineer
Azure Azure Reviewer voltagent-infra:azure-infra-engineer
Active Directory AD Security Reviewer voltagent-qa-sec:ad-security-reviewer
PowerShell PowerShell Reviewer voltagent-qa-sec:powershell-security-hardening
Compliance / GDPR / SOC2 Compliance Reviewer voltagent-qa-sec:compliance-auditor
Accessibility / a11y Accessibility Reviewer voltagent-qa-sec:accessibility-tester
Error handling / logging Error Reviewer voltagent-qa-sec:error-detective
Test automation Test Reviewer voltagent-qa-sec:test-automator
Blockchain / Web3 Blockchain Reviewer voltagent-domains:blockchain-developer
Payment / fintech Fintech Reviewer voltagent-domains:fintech-engineer
IoT / embedded Embedded Reviewer voltagent-domains:embedded-systems
SEO SEO Reviewer voltagent-domains:seo-specialist
Quant / financial models Quant Reviewer voltagent-domains:quant-analyst
Vue / Nuxt Vue Reviewer voltagent-lang:vue-expert
Angular Angular Reviewer voltagent-lang:angular-architect
Next.js Next.js Reviewer voltagent-lang:nextjs-developer
React Native / Expo Mobile Reviewer voltagent-lang:expo-react-native-expert
Electron / desktop apps Desktop Reviewer voltagent-core-dev:electron-pro
Django Django Reviewer voltagent-lang:django-developer
FastAPI FastAPI Reviewer voltagent-lang:fastapi-developer
Spring Boot Spring Boot Reviewer voltagent-lang:spring-boot-engineer
Symfony Symfony Reviewer voltagent-lang:symfony-specialist
C / C++ C++ Reviewer voltagent-lang:cpp-pro
Kotlin Kotlin Reviewer voltagent-lang:kotlin-specialist
Elixir / Phoenix Elixir Reviewer voltagent-lang:elixir-expert
MLOps / model deployment MLOps Reviewer voltagent-data-ai:mlops-engineer
Reinforcement learning RL Reviewer voltagent-data-ai:reinforcement-learning-engineer
Prompt optimization / evals Prompt Reviewer voltagent-data-ai:prompt-engineer
AI systems / agentic apps AI Systems Reviewer voltagent-data-ai:ai-engineer
Database admin / replication / HA DBA Reviewer voltagent-infra:database-administrator
Incident response / outages Incident Reviewer voltagent-infra:incident-responder
Windows Server / IIS Windows Reviewer voltagent-infra:windows-infra-admin
Cloud security / secrets mgmt Cloud Security Reviewer voltagent-infra:security-engineer
CLI tools / TUIs CLI Reviewer voltagent-dev-exp:cli-developer
MCP servers / tools MCP Reviewer voltagent-dev-exp:mcp-developer
Refactoring / legacy modernization Refactoring Reviewer voltagent-dev-exp:refactoring-specialist
Build systems / bundlers Build Reviewer voltagent-dev-exp:build-engineer
Dependencies / supply chain Dependency Reviewer voltagent-dev-exp:dependency-manager
Git workflows / branching Git Workflow Reviewer voltagent-dev-exp:git-workflow-manager
Game dev / engines Game Reviewer voltagent-domains:game-developer
API docs / OpenAPI API Docs Reviewer voltagent-domains:api-documenter
Legal / licensing / contracts Legal Reviewer voltagent-biz:legal-advisor
UX research / usability UX Research Reviewer voltagent-biz:ux-researcher

Multi-Agent Orchestration Mapping (for pre/post-panel phases)

All launches below MUST pass model: "opus" explicitly (v2.14).

Review Phase VoltAgent subagent_type Use When
Data Flow Trace (Phase 2) voltagent-data-ai:data-engineer, model: "opus" Trace data paths, document schemas at boundaries (v2.14)
Completeness Audit (Phase 8) voltagent-meta:knowledge-synthesizer, model: "opus" Synthesize what the panel missed
Claim Verification (Phase 10) voltagent-qa-sec:code-reviewer, model: "opus" Verify line-number citations
Severity Verification (Phase 11) voltagent-qa-sec:debugger, model: "opus" Read actual code for P0/P1 findings
Tier Refinement Advisor (Phase 12b) Generic, model: "opus" (must be domain-neutral to refine tiers)
Verification Agents (Phase 13) Persona-matched — see Phase 13 table, model: "opus" Each agent matched to claim type
Supreme Judge (Phase 14) Generic, model: "opus" (judge must be domain-neutral)
Judge-Output Verifier (Phase 14.5) Generic, model: "opus" Re-verifies judge-introduced P0/P1 against ground truth via grep/Read/git (v3.2.0)
HTML Report Agent (Phase 15.3) voltagent-lang:javascript-pro, model: "opus" Generate interactive HTML dashboard with expandable issue cards (v2.15). Reads from disk: Phase 15.1 report + Phase 15.2 process history + rendering spec from prompt-templates.md (v2.16.4). Loads Tailwind, Chart.js, and Prism.js via CDN.
Merge Agent (Phase 16) voltagent-meta:knowledge-synthesizer, model: "opus" Deduplicate + score stability in multi-run mode (v2.14)

Step 3: Suggest installation when beneficial. If a selected persona would benefit from a VoltAgent agent but the agent family is not available, suggest installation to the user:

"This review would benefit from VoltAgent specialist agents for deeper domain-specific analysis. You can install the relevant families with:

Quick install (CLI): claude plugin install voltagent-qa-sec — security, code review, testing claude plugin install voltagent-data-ai — data science, ML, databases claude plugin install voltagent-infra — DevOps, cloud, Terraform claude plugin install voltagent-lang — language specialists (TS, Python, Go, Rust) claude plugin install voltagent-biz — product, business analysis claude plugin install voltagent-domains — fintech, blockchain, IoT

Or browse via marketplace: /plugin marketplace add VoltAgent/awesome-claude-code-subagents then /plugin install <name>@voltagent-subagents

Continue without them? They're optional — the review will still work with generic persona-prompted agents."

Only suggest installation once per session. List only the families relevant to the detected content signals, not all 10. If the user declines or the agents aren't available, proceed with the generic fallback silently.

Step 4: Launch with subagent_type AND model: "opus". When launching Phase 3 agents:

  • subagent_type: "voltagent-qa-sec:code-reviewer", model: "opus" (when available)
  • Omit subagent_type, pass model: "opus" explicitly (generic agent fallback)

CRITICAL (v2.14): ALWAYS pass model: "opus" even when using subagent_type. VoltAgent agents may declare their own default model (sonnet, haiku) in their frontmatter. Without an explicit override, the panel silently runs on mixed models, producing different reasoning depths across runs. The VoltAgent agent's value lives in its system prompt and tool access, NOT its default model. Forcing opus preserves the domain expertise while guaranteeing consistent reasoning depth. This fix resolves an invisible source of cross-run variance documented in the v2.10→v2.14 consistency analysis.

The persona prompt is STILL included even when using VoltAgent agents — it provides the review-panel-specific context (agreement intensity, reasoning strategy, evaluation criteria) that the VoltAgent agent doesn't have natively.


Live-State Claim Discipline (v3.3.0)

Resolves #40. A panel reading source code can verify what a script or manifest WILL do if executed — it cannot verify what production infrastructure IS doing right now. Conflating the two produced a false-positive P0 "IAM/IAP divergence" finding that survived all 5 reviewers, 3 debate rounds, and the Supreme Judge: the agents read echo "gcloud ... --role=..." lines in two deploy scripts (operator-facing documentation printed to the terminal at deploy-completion time) as if they were the live IAM bindings of the deployed service. A single gcloud run services describe would have falsified it in 30 seconds.

This discipline applies to any finding that asserts a fact about live state — deployed IAM/IAP/auth config, a running cron schedule, a BigQuery table's partition key, a production env var, a load balancer's routing. It is NOT limited to security. It is injected into Phase 3 reviewer prompts, the Phase 5 debate prompt, Phase 11 severity verification, and the Phase 14 judge prompt.

Rule 1 — Declarative vs. imperative vs. documentation

Reviewers must distinguish three things a source file can contain:

Category Example What it proves
Declarative config gcloud run deploy ... --no-allow-unauthenticated, a Terraform resource, a YAML manifest The deploy WILL create this config if run — still not proof it WAS run
Imperative documentation echo " gcloud beta run services add-iam-policy-binding ...", a comment, a README snippet, a printed "next steps" blurb A human is being TOLD to run this later. The script itself does not.
Live state output of gcloud ... describe / ... get-iam-policy, bq show, aws ... describe-*, kubectl get, crontab -l What production actually is right now

Lines inside echo "...", print(...), comments, heredocs echoed to a terminal, or string literals in "usage" / "next steps" blocks are documentation, not configuration. They are never evidence for a live-state claim. Configuration claims must come from declarative deploy flags / manifests; live-state claims must come from live describe-class output.

Rule 2 — Live-state claims need live evidence

Every finding that asserts a live-infrastructure or runtime-state fact carries one of two epistemic tags:

  • [LIVE-VERIFIED] — backed by output from a live-state command (gcloud ... describe / ... get-iam-policy, bq show, aws ... describe-*, kubectl get, crontab -l, etc.) that the panel actually ran or that the user supplied.
  • [STATIC-INFERENCE] — inferred from source code, config files, or deploy scripts only. The panel did not (or could not) observe live state.

A [STATIC-INFERENCE] live-state claim is capped at P1, no matter how many reviewers cite it. P0 ("block the demo") requires [LIVE-VERIFIED] — or the finding must be reworded as "the deploy script would configure X" (a [PLAN_RISK], not an [EXISTING_DEFECT]). When the panel lacks the tools to obtain live evidence, the finding must say so explicitly rather than inferring.

Rule 3 — Consensus does not compound on a shared artifact

When 2+ reviewers reach the same conclusion by reading the same source lines, that is consensus on an interpretation, not independent verification of a fact. It must not be promoted to [VERIFIED] or used to justify P0. Phase 6 (Sycophancy Detection) flags this pattern; the judge tags it [STATIC-INFERENCE-CONSENSUS] and requires independent live verification before any P0 promotion. Cross-citation chains (Security F3 → Architecture F4 → DA CF2) over the same artifact lines are a single source, not three.

Rule 4 — Pre-promotion falsification check

Before any finding is promoted to P0 — in debate (Phase 5) or by the judge (Phase 14) — answer two questions:

  1. What single observation would prove this finding wrong?
  2. Is that observation cheap to obtain?

If a P0 can be falsified by one read-only command (a describe, a show, a grep) and no agent ran it, the finding is at most P1 until verified. Record the falsification test alongside the finding.


Phase 2: Data Flow Trace (v2.14)

A dedicated agent traces data through the critical path(s) of the work BEFORE reviewers begin, producing a structured Data Flow Map. This phase specifically targets composition defects — bugs where two individually- correct functions produce incorrect results together. These bugs are structurally invisible to reviewers who read each function in isolation.

Research foundations: Meta semi-formal certificate prompting (2026, 78%→93% accuracy), LLMDFA (NeurIPS 2024, 87% precision), RepoAudit (ICML 2025, 78% precision with demand-driven exploration), BugLens (ASE 2025, 7x false positive reduction), ZeroFalse (2025, F1 0.955).

Skip Conditions

  • Pure plans/design (no code)
  • Pure documentation (no code)
  • Code with no detectable data transforms (pure API routing, static config, declarative-only files)

When Phase 2 is skipped, note the reason in the Context Brief and the report header. Proceed directly to Phase 3.

Tier System

Three tiers, user-selectable via --trace {tier} or natural language:

Tier Trigger Phrases Paths Traced Overhead Token Budget
Standard (default) no modifier, "review" Single most important path ~5 min ~8k
Thorough "thorough review", "thorough trace", --trace thorough Top 3 paths + transform completeness ~15 min ~20k
Exhaustive "exhaustive review", "trace everything", "catch all bugs", --trace exhaustive ALL paths from every entry point No limit No limit

Tier detection priority:

  1. Explicit --trace {tier} flag
  2. Natural language keywords in user's original prompt
  3. Default: Standard

"Deep review" (which triggers web research) combines with Standard trace unless the user also specifies a trace tier.

Critical Path Identification (orchestrator, not subagent)

Before launching the Data Flow Tracer, the orchestrator identifies entry points and ranks them by data complexity:

  1. Find entry points. Scan for structural markers:

    • Web frameworks: @app.route, @router.get/post, @api_view, Django CBVs
    • CLI: @click.command, @app.command (Typer), if __name__ == "__main__":, argparse
    • Background: @app.task (Celery), AWS lambda_handler, Kafka/SQS consumers
    • Scripts: main(), top-level script execution
  2. Rank by data complexity. Count on each path:

    • Number of function calls
    • Number of data transforms (map/filter/reduce/apply/merge/join/groupby/pivot)
    • Number of I/O boundaries (DB, HTTP, file, queue)
    • Presence of transform/back-transform pairs
  3. Select paths per tier:

    • Standard: top-ranked path only
    • Thorough: top 3 paths
    • Exhaustive: all paths

The Data Flow Tracer Agent

Single agent (model: "opus"). VoltAgent mapping: voltagent-data-ai:data-engineer primary, voltagent-qa-sec:code-reviewer fallback. Always pass model: "opus" even when using subagent_type.

Uses the semi-formal certificate approach from Meta's 2026 agentic code reasoning research. At each function boundary on the critical path, the agent produces a certificate:

FUNCTION: {name} ({file}:{line})
INPUT_SCHEMA:
  - parameter types (declared or inferred)
  - known constraints at call site
  - which parameters are externally controlled
TRANSFORM:
  - what the function does
  - key assignments and branches
  - external calls (I/O, DB, modules)
OUTPUT_SCHEMA:
  - return type
  - tainted/derived fields
  - guaranteed invariants
COMPOSITION_CHECK: (vs next function)
  - Does OUTPUT_SCHEMA satisfy next INPUT_SCHEMA?
  - Fields required but not guaranteed?
  - Tainted fields reaching sensitive parameters?
INVARIANT_STATUS:
  - preserved or violated invariants
  - violations flagged as P0 candidates

See references/prompt-templates.md for the full Phase 2 Data Flow Tracer prompt.

Mandatory Invariant Checks (at every boundary)

  1. Schema preservation — output schema matches next function's expected input
  2. Transform/back-transform completeness — list forward transforms (log, encode, serialize) and back-transforms (exp, decode, deserialize). Any field in forward but not back is a P0 candidate. See the Transform/Back- Transform Completeness checklist in references/signals-and-checklists.md.
  3. Row count stability — joins/merges/reindex/groupby should not silently add or remove rows
  4. Null semanticsfillna(0) does not destroy meaningful missingness
  5. Temporal consistency — date filters applied to all date columns; ALL instances of an excluded event (e.g., BOTH Christmases) handled

Output and Integration with Phase 3

The Data Flow Tracer produces a Data Flow Map containing:

  • List of paths traced
  • Per-function certificates
  • Invariant violations table (P0 candidates)
  • Transform completeness table
  • Clean paths (where all invariants hold)

Integration with Phase 3: The Data Flow Map is injected into every reviewer's Phase 3 prompt as dedicated context. Invariant violations are flagged as P0 candidates; reviewers must either validate them (agree they're real P0s) or explicitly challenge them with reasoning. Reviewers are NOT required to agree with the tracer — this is an additional input, not a mandate.

When no violations are found, reviewers receive a short "clean trace" confirmation instead.


Phase 3: Independent Review (Round 0)

Launch ALL reviewer agents in parallel using Agent tool with model: "opus". When VoltAgent integration is active, use subagent_type from the mapping table. Each gets the structured prompt from references/prompt-templates.md (Phase 3 template) with their persona, agreement intensity, reasoning strategy, context brief, and the full work content inside injection boundaries. The prompt also carries the Live-State Claim Discipline (Rules 1–2): reviewers must tag every live-infrastructure/runtime-state claim [LIVE-VERIFIED] or [STATIC-INFERENCE] and must not treat echo/comment/usage-blurb lines as configuration.

Collect all N independent reviews.

Output (v3.1.0+): Each reviewer subagent writes its full review to state/reviewer_<name>_phase_3.md and returns only the path + a 100-word summary. The orchestrator does NOT hold verbatim reviews in its window.


Phase 4: Private Reflection

Launch all reviewers in parallel, each receiving ONLY their own review. They re-read source, rate confidence per finding (High/Medium/Low), note new issues, identify most/least defensible findings. See references/prompt-templates.md.

Output (v3.1.0+): Each reviewer's reflection is written to state/reviewer_<name>_phase_4.md. Subagent returns only path + 100-word summary.


Phase 5: Debate (Rounds 1-3, adaptive)

Launch all reviewers in parallel each round. Each receives their own review

  • reflection, all others' feedback, and unresolved points from previous round.

Output (v3.1.0+): Each reviewer's per-round debate response is written to state/reviewer_<name>_phase_5_round<R>.md (R = 1, 2, or 3). Round 1 is mandatory; rounds 2 and 3 follow the existing convergence-based skip rules. Subagent returns only path + 100-word summary.

Pre-promotion falsification check (v3.3.0). Before any finding is promoted to — or kept at — P0 in any debate round, the reviewer must state the single observation that would falsify it and whether that observation is cheap to obtain (see Live-State Claim Discipline Rule 4). A P0 that one read-only command could falsify, with no agent having run it, is capped at P1 until verified.

Phase 6: Round Summarization

After each round, summarize (no agent needed):

  • Resolved this round — who agreed, what convinced them
  • Still in dispute — with inlined source excerpts (max 10 lines per dispute, first 5 + last 5 if longer; max 3 disputes). If a reviewer's claim cannot be traced to a specific source location, tag [source not cited by reviewer].
  • New discoveries — from which reviewer

Sycophancy Detection (CONSENSAGENT)

Count position changes toward majority. If >50% lack new evidence → inject sycophancy alert into next round prompt for all reviewers.

Shared-artifact consensus (v3.3.0). Also flag when 2+ reviewers agree on a claim by reading the same source lines without independent verification — including cross-citation chains where each reviewer cites the previous one's finding rather than the source. This is consensus on an interpretation, not on a fact: it does not compound to [VERIFIED] and must not justify P0. Tag such points [STATIC-INFERENCE-CONSENSUS] (Live-State Claim Discipline Rule 3) and route them to verification before any P0 promotion.

Convergence Check

  • All disputes minor/stylistic → stop
  • Substantive disagreements remain → continue
  • New discoveries still emerging → continue
  • Maximum 3 rounds regardless

Phase 7: Blind Final Assessment

Launch all reviewers one final time in parallel. Each gives final score, top 3 points, recommendation, one-line verdict. Others do NOT see these.

Output (v3.1.0+): Each reviewer's blind final is written to state/reviewer_<name>_phase_7.md. Subagent returns only path + 100-word summary of new findings.


Phase 8: Completeness Audit

Single agent (model: "opus") hunts for what the entire panel missed. Does NOT evaluate quality — only finds overlooked details, edge cases, constants, code. See references/prompt-templates.md for full prompt.

Mandatory audit checks (in addition to general completeness):

  • Temporal scope verification: For every claim that excludes, filters, or masks a time period, count all instances in the full date range and verify each is handled. Example: "excludes Christmas" with 2 years of data must exclude BOTH Christmases. This is the #1 class of bug that reviewers miss because they focus on the method, not the temporal arithmetic.

Output (v3.1.0+): Subagent writes full output to state/phase_8_audit.md and returns only path + 100-word summary.


Phase 9: Verification Command Execution (v2.8)

Run up to 5 reviewer verification_command entries for P0/P1 findings (P0 first). Validate read-only (grep/cat/head/tail/wc only), execute via Bash, annotate: [CMD_CONFIRMED], [CMD_CONTRADICTED] (demote 1 level), [CMD_INCONCLUSIVE], [CMD_FAILED]. Advisory, not gating — demotes but does not delete. Skip this phase if no verification commands were provided.


Phase 10: Claim Verification

Single agent (model: "opus") checks all reviewer citations against source. Classifies each as [VERIFIED], [INACCURATE], [MISATTRIBUTED], [HALLUCINATED], or [UNVERIFIABLE]. Results feed into judge prompt.

Output (v3.1.0+): Subagent writes full output to state/phase_10_claim_verification.md and returns only path + 100-word summary.


Phase 11: Severity Verification (v2.7)

Single agent (model: "opus") that reads the actual codebase to verify every P0 and P1 finding before the judge sees them. This phase exists because panels systematically overstate severity when they lack runtime context (v2.6 benchmark: 2/3 P0 findings were overstated after code investigation).

For each P0/P1 finding, the agent must:

  1. Classify as [EXISTING_DEFECT] or [PLAN_RISK]
    • [EXISTING_DEFECT]: The bug exists in the current running code right now
    • [PLAN_RISK]: The risk would only materialise if the plan is implemented as written
    • P0 severity requires [EXISTING_DEFECT]. A [PLAN_RISK] is at most P1.

1b. Live-state claim classification (v3.3.0) — If the finding asserts a fact about live infrastructure or runtime state (deployed IAM/IAP/auth config, a running cron schedule, a production env var, a BigQuery partition key, a load balancer's routing), apply the Live-State Claim Discipline:

  • Tag [LIVE-VERIFIED] only if backed by live describe-class command output the panel ran or the user supplied; otherwise tag [STATIC-INFERENCE].
  • A [STATIC-INFERENCE] live-state claim is capped at P1 regardless of reviewer count. Do NOT let it stay P0 on consensus alone.
  • Reject echo/comment/usage-blurb lines as evidence — those are documentation, not configuration. A claim resting only on such lines is [STATIC-INFERENCE] at best and is frequently [INACCURATE].
  1. Verify the claim against actual code

    • If the finding says "X is missing", grep for X in the actual codebase
    • If the finding says "X pattern is wrong", read the referenced code and check
    • If the finding cites a specific file/line, read that file and verify
    • If no reviewer cited a specific line number, flag as [UNCITED]
  2. Check for existing safety mechanisms

    • Grep for DELETE, MERGE, upsert, idempotent, dry-run, duplicate, assertion patterns near the referenced code
    • A finding about "missing safety" is invalid if the safety exists but the reviewer didn't look for it
  3. Output a severity verification table:

| Finding | Panel Severity | Verified? | Actual Severity | Reason |
|---------|---------------|-----------|-----------------|--------|
| ...     | P0            | No        | Not a bug       | Grep found no bf/af COALESCE pattern |
| ...     | P0            | Partial   | P1              | DELETE-then-INSERT already exists |
  1. External domain claim detection and web verification (v2.16.3)

    Why this exists: Consensus P0 findings that depend on external domain knowledge bypass the Phase 12/13 dispute-verification pipeline entirely (because there is no dispute to trigger it). But all reviewers can be wrong the same way — shared model bias or shared domain knowledge gaps. In a real engagement (PUMA GA4 audit, 2026-04-09), all 4 reviewers unanimously flagged "50 months = GA4 360" as P0 without verifying whether 50 months is even a valid GA4 setting. The claim happened to be correct, but the panel had no mechanism to verify it. If the source data had been wrong, the panel would have confidently presented an incorrect P0.

    For each P0/P1 finding, classify whether it depends on external knowledge:

    • External domain claim: The finding's validity depends on facts outside the reviewed codebase — product feature limits, API behavior, regulatory jurisdiction, pricing tiers, platform capabilities, protocol specifications, third-party documentation. Examples: "50 months retention means GA4 360", "GDPR applies to Mexico", "this API rate-limits at 100 req/s."
    • Internal claim: The finding is fully verifiable from the reviewed code, config, or documentation. No external knowledge needed.

    For each finding classified as external domain claim:

    • Run a web search to verify the specific factual premise (cap: 2 searches per claim, 5 claims max per review)
    • Tag result: [WEB-VERIFIED] (confirmed by authoritative source), [WEB-CONTRADICTED] (external source disagrees — demote severity by 1 level), [WEB-INCONCLUSIVE] (no authoritative source found — flag for judge)
    • Include the source URL and key quote in the verification table
    • Regulatory/jurisdiction claims (e.g., "GDPR applies to X country") are ALWAYS classified as external domain claims

    Extended severity verification table:

    | Finding | Severity | Domain Type | Web Result | Source | Adjusted Severity |
    |---------|----------|-------------|------------|--------|-------------------|
    | ...     | P0       | External    | [WEB-VERIFIED] | support.google.com/... | P0 (confirmed) |
    | ...     | P1       | External    | [WEB-CONTRADICTED] | gdpr.eu/... | P2 (demoted) |
    | ...     | P0       | Internal    | N/A        | N/A    | P0 (code-verified) |
    

    Skip condition: If all P0/P1 findings are internal claims (fully verifiable from the reviewed content), skip web verification.

Results feed into the Supreme Judge prompt. The judge MUST reference the verification table when ruling on disagreements.

Output (v3.1.0+): Subagent writes full output to state/phase_11_severity_verification.md and returns only path + 100-word summary.


Phase 12: Verification Tier Assignment (v2.11)

After Phases 8–11, collect all unresolved dispute points from Phase 6 summaries plus any high-uncertainty action items bearing [SINGLE-SOURCE], [DISPUTED], or [UNVERIFIED] labels. Each point is assigned a depth tier that controls the verification agent's budget and capabilities in Phase 13.

Skip condition: If there are zero unresolved disputes and zero unverified action items, skip Phases 12 and 13 entirely.

Tier Definitions

Tier Budget Capabilities When to Use Example
Light ~2k tokens grep/read only, no web search Factual claim checkable in a single file or constant lookup "Reviewer A claims the threshold constant is 0.05 but the report says 0.5 — check the code."
Standard ~8k tokens Multi-file reads, import tracing, static analysis Claim requires following logic across files or comparing multiple outputs "Two reviewers disagree on whether the rate-limiter handles concurrent requests — trace the implementation across its dependencies."
Deep ~32k tokens Web search, multi-round reasoning Requires external knowledge, novel domain, or fundamental disagreement unresolvable from code alone "Security reviewer claims the PRNG is cryptographically weak for this use case — requires researching current best practices for the specific algorithm."

Assignment Pipeline (default: both steps; quick mode: step 1 only)

Tier assignment runs as a two-step pipeline. Step 1 is always fast; step 2 (the judge refinement) is the default but can be skipped by requesting "quick tier assignment" or "confidence-based tiers only".

Step 1 — Confidence-Based Draft (always runs; no agent needed):

The orchestrator derives initial tier assignments from existing Phase 4 confidence ratings and debate round signals:

  • Deep: Any reviewer rated the claim Low confidence in Phase 4, OR the point remained unresolved across 2+ debate rounds, OR the claim requires external or runtime knowledge (e.g., production behavior, third-party API semantics, literature validation)
  • Standard: Any reviewer rated Medium/mixed confidence, OR unresolved for exactly 1 debate round, OR claim requires cross-file logic tracing
  • Light: All reviewers rated the claim High confidence AND it is a simple checkable fact (file exists, value matches, line present)

Produces a draft tier table:

| Point # | Summary | Draft Tier | Signal (confidence ratings + rounds unresolved) |
|---------|---------|------------|------------------------------------------------|

Step 2 — Judge-Advised Refinement (default: on):

A single Opus agent (Phase 12b) receives the confidence-based draft table and all supporting context (context brief, Phase 6 summaries, Phase 7 blind finals, completeness audit, claim and severity verification results). Its job is to review and refine the draft — upgrade, downgrade, or confirm each tier with reasoning. It also assigns the verification persona per point.

The advisor works from the draft rather than from scratch: the confidence ratings give it the "ground-level" signal from reviewers who lived through the debate, and the advisor's role is oversight and correction, not cold assessment from zero.

Final tier table:

| Point # | Summary | Draft Tier | Final Tier | Override Reason | Suggested Persona |
|---------|---------|------------|------------|-----------------|-------------------|

Phase 13: Targeted Verification Agents (v2.11)

Dispatch one verification agent per collected dispute/action item. All Light and Standard agents launch in parallel; Deep agents can also parallelize unless they share a scarce resource (e.g., web search rate limits).

Persona Matching

Classify each claim's type and select the matching verification persona. VoltAgent agents are preferred when available; fall back to generic + focused prompt.

Claim Type Verification Persona VoltAgent (preferred)
Statistical / numerical Data Scientist voltagent-data-ai:data-scientist
Code correctness / logic Code Reviewer voltagent-qa-sec:code-reviewer
Architecture / design Architect Reviewer voltagent-qa-sec:architect-reviewer
Security vulnerability Security Auditor voltagent-qa-sec:security-auditor
Performance / scalability Performance Engineer voltagent-qa-sec:performance-engineer
Database / SQL Database Expert voltagent-data-ai:database-optimizer
Infrastructure / ops SRE voltagent-infra:sre-engineer
Framing / narrative Domain expert Generic + domain context
Business logic / feasibility Business Analyst Generic + business context
Default / unclear Verification Agent Generic + focused prompt

Capability Limits by Tier

  • Light (~2k tokens): May only grep/read/head/tail. Single focused query. Return one of [VR_CONFIRMED], [VR_REFUTED], [VR_INCONCLUSIVE] with one piece of quoted evidence. Do not expand scope beyond the specific claim.
  • Standard (~8k tokens): May read multiple files, trace imports, run static analysis commands. Return verdict with supporting evidence from multiple sources. Explore adjacent code only if directly relevant to the dispute.
  • Deep (~32k tokens): Full agent capabilities including web search and multiple reasoning rounds. Return a comprehensive verdict; cite external sources when they resolve the dispute. Scope limited to the specific dispute — do not produce a second full review.

Verdict Labels

  • [VR_CONFIRMED] — Evidence confirms the original claim
  • [VR_REFUTED] — Evidence contradicts the claim
  • [VR_PARTIAL] — Claim is partially supported; the agent qualifies what holds
  • [VR_INCONCLUSIVE] — Insufficient evidence to verify either way
  • [VR_NEW_FINDING] — Verification revealed an additional issue beyond the dispute

Verification Round Summary

After all agents complete, compile into a summary table:

| Point | Tier | Persona | Verdict | Key Evidence |
|-------|------|---------|---------|--------------|

This table is passed to Phase 14 as input item 8.


Phase 13.5: Pre-Judge Verification Gate (v3.1.0)

Before launching the Supreme Judge (Phase 14), the orchestrator MUST verify that all mandatory phase outputs exist on disk. This gate is the load-bearing guardrail against silent compression of Phases 4 / 5 / 7.

Gate logic (orchestrator-executed, no subagent dispatch):

For each reviewer in the panel, verify these files exist under state/ (or state/run_<N>/ in multi-run mode):

Required file Phase Mandatory
reviewer_<name>_phase_3.md Independent review Always
reviewer_<name>_phase_4.md Private reflection Always
reviewer_<name>_phase_5_round1.md Debate round 1 Always (rounds 2/3 per existing skip rules)
reviewer_<name>_phase_7.md Blind final Always

Plus panel-level files:

  • phase_8_audit.md
  • phase_10_claim_verification.md
  • phase_11_severity_verification.md

For each required file, run three checks:

  1. Existence check — file is present on disk.
  2. Minimum-bytes check — file size ≥ 500 bytes. Below this is empirically a stub (subagent crashed mid-write or returned a placeholder).
  3. Required-headers check — parse the file and confirm it contains the required schema sections for that phase (e.g., a Phase 3 review must contain a Score, a Findings section, and severity tags). The exact required sections per phase are defined in references/prompt-templates.md.

On gate failure for any file:

  1. Log loudly: GATE FAIL: <file> missing | stub | malformed
  2. Re-dispatch the subagent for the missing/malformed phase output.
  3. Re-run the gate after re-dispatch.
  4. Single retry only. If the second attempt also fails, do NOT block the run. Mark the phase as unrecoverable, write the COMPRESSED RUN header in Phase 15.1 (see Phase 15.1 spec), and proceed to Phase 14 with the partial input. The deliverable is produced with explicit warning rather than failing entirely — partial review with loud warning beats no review.

On full gate pass: proceed to Phase 14. The COMPRESSED RUN header is NOT emitted (its absence is the green light).

Debate-presence assertion (v3.5.0) — distinct from per-file compression. The per-file gate above re-dispatches an individual missing file. But the dominant real-world failure (2026-06-06 audit: 50/51 runs had no debate) is the wholesale skip — the orchestrator never ran Phase 5 at all and jumps from independent reviews straight to the judge. So, separately from the per-file check, count the reviewer_*_phase_5_round1.md files across the whole panel (or state/run_<N>/ in multi-run mode):

  • If the count is ZERO when mode = full panel (the entire debate phase is absent, not just one reviewer's file), this is the NO-DEBATE condition. Do NOT proceed to Phase 14 silently. Instead:
    1. Preferred — run Phase 5 now. Debate was skipped; execute it (round 1 is non-skippable per the protocol) and re-run this assertion.
    2. If debate is genuinely unavailable for this execution shape (e.g., the run is executing as a parallel Workflow / ultracode fan-out with no sequential cross-talk primitive — see Debate inside a Workflow below), stamp the [NO-DEBATE] banner (Phase 15.1 / 15.3) and lower the verdict confidence (cap at Medium). The judge still rules, but the report announces that no adversarial cross-examination happened.
  • If the count is ≥ 1, debate ran; no NO-DEBATE banner. (Individual missing round-1 files for some reviewers remain a per-file COMPRESSED case, handled above.)

Detection is not solely anchored here. Phase 13.5 does not fire on every execution shape (an inline/workflow-shaped run can skip this gate entirely — that is exactly how the audit's silent skips slipped through). The load-bearing NO-DEBATE check therefore also runs at the Phase 15.1 report-write chokepoint (every completed run passes through it). See Phase 15.1.

Why bytes + headers, not just existence: A subagent can write a stub and crash, leaving an empty/partial file. Existence alone passes the gate on a stub. Bytes + required-headers makes the check load-bearing. This mirrors how the Phase 15 verification gate (v2.16.4) validates HTML output structurally, not just by file presence.


Phase 14: Supreme Judge

Single agent (model: "opus"). The launch prompt is ~200 tokens of metadata: the paths to the state files produced by Phases 3, 4, 5, 7, 8, 10, 11, and 13. The judge reads state files on demand using the Read tool — it does NOT receive verbatim phase outputs pre-stuffed into its launch prompt. This mirrors the Phase 15.3 HTML-agent pattern (v2.16.4) and caps the judge's window load even when the panel has produced hundreds of kilobytes of material.

The judge's ruling is materialized to state/phase_14_judge_ruling.md so Phase 15.1 can later consume it from disk (rather than from chat).

Steps (in order): 0. Review verification results (claims, severity, commands, and verification round) 0.5a-b. Verify audit findings, anti-rhetoric assessment 0.5c. Severity dampening — minimum evidence-justified severity. In Precise mode, findings without code citations cannot exceed P2. Live-State Claim Discipline (v3.3.0): a live-infrastructure/runtime-state claim tagged [STATIC-INFERENCE] cannot exceed P1, and a P0 that one cheap read-only command could falsify (with no agent having run it) is capped at P1 until verified. 0.5d. Coverage check — flag unexamined risk categories, scan source for gaps 1-3. Debate quality, disagreement rulings, consensus correctness. A [STATIC-INFERENCE-CONSENSUS] point — multiple reviewers agreeing off the same artifact lines — counts as one source, not independent verification. 4-5. Absent-safeguard check, independent gap scan, score assessment 6-7. Epistemic label classification (including [LIVE-VERIFIED] / [STATIC-INFERENCE] / [STATIC-INFERENCE-CONSENSUS] for live-state claims), final verdict 8-9. Action items, meta-observation 10. Write ruling to {state_dir}/phase_14_judge_ruling.md (v3.1.0+).

See references/prompt-templates.md for the full judge prompt.


Phase 14.5: Post-Judge Verification Gate (v3.2.0)

The Supreme Judge in Phase 14 can introduce new P0/P1 findings as a side effect of its Step-0 Verification Review — findings the panel never raised. Phase 11 (Severity Verification) only re-verifies panel-raised P0/P1, so judge-introduced findings bypass every prior verification phase. A 2026-04-27 README review run produced a hallucinated "12 unresolved git conflict markers" P0 (the file was clean — wc -l and grep -c both confirmed) and that single fabricated finding drove a 3/10 REJECT-AND- REWRITE verdict (issue #41).

Phase 14.5 closes this gap by re-verifying every judge-introduced P0/P1 against ground truth before Phase 15.1 generates the report.

A single Opus agent runs after Phase 14 and before Phase 15.1. Inputs are the paths to {state_dir}/phase_14_judge_ruling.md, {state_dir}/phase_11_severity_verification.md, and {state_dir}/phase_8_audit.md. The agent has grep / Read / Bash tools.

Steps:

  1. Classify each P0/P1 finding in the judge ruling as [PANEL-RAISED] (skip — covered by Phase 11) or [JUDGE-INTRODUCED] (verify here).
  2. For every [JUDGE-INTRODUCED] finding, run a ground-truth check appropriate to the claim type (location, state, existence, external domain) using grep/Read/git/Bash. Quote actual command output.
  3. Issue a verdict per finding: [JUDGE-CONFIRMED] (passes through), [JUDGE-HALLUCINATED] (demote to P3 or remove if actively contradicted), or [JUDGE-PARTIAL] (demote one level, edit to retain only the replicated sub-claim).
  4. If any P0 was demoted/removed, recompute the verdict score against the panel mean and document the override.
  5. Write the full verification table to {state_dir}/phase_14_5_judge_verification.md. Phase 15.1 reads it.

Phase 15.1 banner. When the gate produces any [JUDGE-HALLUCINATED] entry, Phase 15.1 MUST emit this block immediately after the Executive Summary (and after the Compressed Run banner if present):

> ⚠️ **Judge Verification:** N judge-introduced finding(s) flagged as
> [JUDGE-HALLUCINATED] in Phase 14.5. Verdict score replaced with panel
> mean (X/10 → Y/10). Affected action items below carry the
> [JUDGE-HALLUCINATED] suffix.

Affected action items keep the [JUDGE-HALLUCINATED] epistemic-label suffix in both the markdown report and the HTML dashboard expandable issue card metadata.

Empty case. If the gate produces zero [JUDGE-INTRODUCED] findings (every P0/P1 was already panel-raised), it writes a stub file with the single line "No judge-introduced findings to verify" so Phase 15.1's disk-read still succeeds.

See references/prompt-templates.md for the full Judge-Output Verification Agent prompt.


Phase 15: Output Generation

Three output files are written at the end of every review. They are produced in strict sequence: Phase 15.1 first, then Phase 15.2, then Phase 15.3. Phase 15.3 runs AFTER Phase 15.2 (not in parallel) so that the Phase 15.3 agent can read the already-written Phase 15.1 and 15.2 files from disk, avoiding the need for the orchestrator to inject all structured data and process history into the agent prompt from its own context window.


Phase 15.1: Primary Markdown Report

Write structured summary to review_panel_report.md (or user-specified name). This is the main deliverable — concise, structured, action-oriented.

Compressed-run warning (v3.1.0+): If the Phase 13.5 verification gate detected any unrecoverable missing phase output, Phase 15.1 MUST emit this block as the FIRST content of the report (before any other section, including Executive Summary):

> ⚠️ **COMPRESSED RUN — Phases skipped: <comma-separated list, e.g., "4 (security), 5 (security, devils-advocate)">**
>
> This run did not complete the full panel protocol. The Supreme Judge ruled
> on partial input. Findings below should be treated as **lower confidence**
> than a full-run report. Re-run the panel for a complete review.

Additionally, in compressed runs, every action item MUST have [COMPRESSED] appended to its epistemic label (e.g., [CONSENSUS][COMPRESSED], [VERIFIED][COMPRESSED]).

For full runs, the warning block is absent. Its absence is the green-light signal that the panel completed the full protocol.

No-debate warning (v3.5.0) — the load-bearing debate-skip chokepoint. Phase 15.1 is the terminal step every completed run passes through, so the NO-DEBATE detection is anchored HERE (not only in the Phase 13.5 gate, which an inline/workflow-shaped run can skip — that is precisely how the audit's silent skips slipped past). Before writing the report, the orchestrator MUST independently check whether any reviewer_*_phase_5_round1.md state files exist (under state/, or any state/run_<N>/ in multi-run mode). If none exist — the adversarial debate (Phase 5) did not run for this panel — Phase 15.1 MUST emit this block as report content (immediately AFTER the COMPRESSED RUN block if one is present, otherwise FIRST, before the Executive Summary):

> ⚠️ **[NO-DEBATE] — adversarial debate (Phase 5) did not run.**
>
> Reviewers evaluated independently but never cross-examined each other's
> findings. The Supreme Judge reconciled disagreements alone, without a debate
> record. Treat consensus and disagreement rulings as **lower confidence** —
> no reviewer had the chance to revise a verdict in light of a peer's. For a
> high-stakes or adversarial-tradeoff decision, re-run the **full** panel with
> debate (invoke as a skill, not a workflow), or use the debate-in-Workflow
> recipe.

In a no-debate run:

  • Every action item MUST have [NO-DEBATE] appended to its epistemic label (e.g., [CONSENSUS][NO-DEBATE], [SINGLE-SOURCE][NO-DEBATE]).
  • The **Confidence:** header field MUST be capped at Medium (if the judge ruled High, lower it one level to Medium and note why); a no-debate run can never report High confidence.
  • The "Debate Rounds + Summaries" collapsible in Detailed Reviews renders the placeholder "No debate rounds — Phase 5 did not run for this panel."

Banner stacking & the COMPRESSED overlap. COMPRESSED (per-file loss) and NO-DEBATE (wholesale Phase-5 absence) are distinct signals and stack: when both apply, render NO-DEBATE first (it is the more specific, higher-severity signal for the verdict), then COMPRESSED. NO-DEBATE is the named signal for zero Phase-5 output, so a COMPRESSED block need not also enumerate "5" in its phases-skipped list when the NO-DEBATE banner is present. For full runs where debate ran, the NO-DEBATE block is absent; its absence is the green-light signal that adversarial debate occurred.

# Review Panel Report
**Work reviewed:** {title/path}  |  **Date:** {today}
**Panel:** {N} reviewers + Auditor + Judge
**Verdict:** {recommendation}  |  **Confidence:** {High|Medium|Low}
**Auto-detected signals:** {list or "None — base set used"}
**Review mode:** {Precise|Exhaustive|Mixed} (auto-detected from content type)
**Data flow trace:** {Standard|Thorough|Exhaustive} tier | {N} paths traced | {M} invariant violations (v2.14)
{If skipped: "**Data flow trace:** Skipped ({reason — pure docs / no transforms / plan-only})"}
**Codebase state:** {branch name} | {N commits behind {default_branch}} | {worktree: yes/no}
{If multi-run: "**Runs:** {N} (personas rotated per schedule)"}
{If multi-run: "**Run stability:** {X}% of findings appeared in 2+ runs | {Y} single-run findings"}
{If stale: "⚠️ STALE BRANCH — {N} commits merged to {default_branch} since branch point. Findings about missing code should be verified against {default_branch}."}

## Executive Summary
{Judge's verdict, 3-5 sentences. Score X/10.}
{If score spread < 2: Correlation Notice about shared model biases}
{If Low confidence: "⚠️ HUMAN REVIEW RECOMMENDED"}

## Scope & Limitations
{What was reviewed. What CANNOT be evaluated: runtime behavior, production
data, security via dynamic analysis. Structural limitation: shared base model.}
Epistemic labels: [VERIFIED] [CONSENSUS] [SINGLE-SOURCE] [UNVERIFIED] [DISPUTED] [WEB-VERIFIED] [WEB-CONTRADICTED] [WEB-INCONCLUSIVE] [JUDGE-HALLUCINATED] [LIVE-VERIFIED] [STATIC-INFERENCE] [STATIC-INFERENCE-CONSENSUS]
Defect type labels: [EXISTING_DEFECT] (bug in current code) [PLAN_RISK] (risk if plan is implemented as written)

## Score Summary
| Reviewer | Persona | Intensity | Initial | Final | Recommendation |

## Consensus Points
{Bullet list of points all/most reviewers agreed on, confirmed by judge}

## Disagreement Points (with judge rulings)
{Each disagreement: Side A, Side B, Verification Round result if run, Judge's ruling with reasoning}

## Completeness Audit Findings
{New issues found by auditor, verified by judge}

## Coverage Gaps (if any)
{Risk categories no reviewer examined, with judge's independent assessment}

{If multi-run: "## Run Comparison"}
{If multi-run: Table showing which findings appeared in which runs, with stability labels}

## Action Items (with severity AND epistemic labels{, and stability labels if multi-run})

## Detailed Reviews (collapsible sections)
- Data Flow Map (Phase 2, v2.14) — if tracer ran
- Round 0: Independent Reviews
- Private Reflections
- Debate Rounds + Summaries
- Final Blind Assessments
- Completeness Audit
- Verification Command Execution Results
- Claim Verification Report
- Severity Verification Table
- Verification Tier Assignment (4.8)
- Targeted Verification Results (4.9)
- Supreme Judge Full Analysis

Phase 15.2: Full Agent Process History

Write review_panel_process.md — the "director's cut". This is a complete, chronological, verbatim log of every agent's output with nothing summarized away. The orchestrator assembles this from accumulated outputs; no new agent needed.

Persona profiles are embedded at the point each agent first enters the flow: before each agent's output, a structured "Persona Profile" block documents that agent's role, expertise, reasoning strategy, agreement intensity (for panelists), matched-claim-type (for Phase 13 agents), and which phases they participated in. This makes the process history fully self-explanatory to a reader who wasn't present.

Structure (in order, verbatim for each):

Persona Profiles Registry (at top)
  - All panelist profiles listed before any review output
  - Phase 12b tier advisor profile
  - Phase 13 verification agent profiles (added as they are assigned)
  - Supreme judge profile

Phase 1: Setup
  - Context Brief (full)
  - Persona selection rationale
  - Review mode detection

Phase 3: Independent Reviews
  - [Persona Profile — Persona A] full profile block
  - [Persona A] Full review text
  - [Persona Profile — Persona B] full profile block
  - [Persona B] Full review text
  - ... (all N)

Phase 4: Private Reflections
  - [Persona A] Full reflection + per-finding confidence ratings
  - [Persona B] Full reflection
  - ... (all N)

Phase 5: Debate Rounds
  - Round 1: All reviewer responses (verbatim)
  - Phase 6 Summary: Resolved / Still in dispute / New discoveries
  - Round 2: All reviewer responses (if run)
  - Phase 6 Summary: ...
  - Round 3: ... (if run)

Phase 7: Blind Final Assessments
  - [Persona A] Final score, top 3 points, recommendation, verdict
  - [Persona B] ...
  - ... (all N, unsealed)

Phase 8: Completeness Audit
  - Full auditor output

Phase 9: Verification Command Execution
  - Each command run, raw output, annotation

Phase 10: Claim Verification
  - Full verification table + flagged claims

Phase 11: Severity Verification
  - Full severity verification table + reasoning per finding

Phase 12: Verification Tier Assignment
  - Phase 12a: Confidence-based draft table (with signals)
  - [Persona Profile — Tier Refinement Advisor] profile block
  - Phase 12b: Tier refinement advisor full output (overrides + reasoning)

Phase 13: Targeted Verification Agents
  - [Persona Profile — Verification Agent: Point #1] full profile block
    (role, matched-claim-type, why matched, tier, VoltAgent subagent or generic)
  - [Point #1 — Tier — Persona] Full investigation trail, what was searched,
    what was found, full reasoning, verdict
  - [Persona Profile — Verification Agent: Point #2] ...
  - [Point #2 ...] (all N verification agents, verbatim)

Phase 14: Supreme Judge Deliberation
  - [Persona Profile — Supreme Judge] profile block
  - Full judge output (all steps, unabridged)

See references/prompt-templates.md for the Phase 15.2 assembly spec.


Phase 15.3: Interactive HTML Report

Launch a single Opus agent to write review_panel_report.html — a polished, self-contained single-file interactive dashboard with expandable issue cards (v2.15).

CRITICAL — Data passing strategy (v2.16.4 context-pressure fix): Do NOT inject the structured data or process history into the agent prompt from the orchestrator's context. Instead, the agent prompt MUST instruct the agent to read from disk:

  1. Read review_panel_report.md (already written by Phase 15.1) for all structured summary data (verdict, scores, action items, consensus, etc.)
  2. Read review_panel_process.md (already written by Phase 15.2) for verbatim reviewer narratives, debate transcripts, judge rulings, and verification agent trails — extracting per-finding content for the 10-section accordion
  3. Read references/prompt-templates.md starting from the line ## Phase 15.3: HTML Report Generation Prompt for the full rendering spec (HTML structure, CSS, JS, expandable card schema, filter logic, Prism.js setup, print styles)

Path resolution (CRITICAL): The orchestrator MUST resolve all paths to absolute paths before including them in the Phase 15.3 agent prompt. The subagent has no knowledge of the skill installation directory or the user's output directory. Substitute:

  • {output_dir} → the actual resolved output directory (where Phase 15.1 wrote review_panel_report.md)
  • {skill_dir} → the absolute path to the skill's references/ directory
  • If the user specified a custom output name (e.g., --output my_review.md), use the actual filenames, not the defaults

The orchestrator's Phase 15.3 launch prompt should be SHORT (~10 lines):

You are the Phase 15.3 HTML Report Agent. Generate `{output_dir}/{html_filename}`
by reading these files:
1. {output_dir}/{report_filename} — structured review data
2. {output_dir}/{process_filename} — verbatim narratives and transcripts
3. {skill_dir}/references/prompt-templates.md (search for "Phase 15.3: HTML
   Report Generation Prompt") — the authoritative rendering spec
Follow the rendering spec exactly. Write the complete HTML file.

This keeps the orchestrator's launch prompt under 200 tokens instead of 700+ lines, eliminating the context-pressure failure mode.

Features:

  • Dashboard overview: verdict, score, panel composition at a glance

  • Stats row: issue counts by severity (P0–P3), tier (Light/Standard/Deep), verdict (VR_CONFIRMED/VR_REFUTED/VR_PARTIAL/VR_INCONCLUSIVE/VR_NEW_FINDING)

  • Charts: confidence distribution, tier breakdown (donut), verdict breakdown (horizontal bar), pipeline flow (issues entering/surviving each verification phase)

  • Panel Gallery: collapsible section with avatar cards for every agent — panelists (role, agreement intensity, reasoning strategy, phase badges), Phase 13 verification specialists (matched claim type, why matched, tier, "verified N items" count), and support agents (auditor, judge, tier advisor). Clicking a panelist card filters the issue list to items they raised.

  • Expandable issue cards (v2.15): each card is a native <details> element. The collapsed state shows the one-line summary; the expanded state reveals a 10-section accordion (each section is its own nested <details>):

    1. 📖 Narrative — full reviewer reasoning (verbatim, not summarized)
    2. 📄 Code Evidence — file:line snippets with Prism.js syntax highlighting
    3. 👥 Raised by — per-reviewer severity + reasoning grid
    4. 🔍 Verification Trail — full VR agent output (if verified)
    5. 💬 Debate — round-by-round transcript (if disputed)
    6. ⚖️ Judge Ruling — full reasoning + severity-change explanation
    7. 🛠️ Fix Recommendation — proposed change + before/after code + regression test + blast radius + effort
    8. 🔗 Cross-references — related findings with relationship labels
    9. 🏷️ Epistemic Tags — hover tooltips explaining each label
    10. 📊 Prior Runs — meta-review comparison (if multi-run)

    Empty sections render "No {section} data" placeholders — all 10 sections always present for consistent card structure.

  • Deep-link support: report.html#issue-A1 auto-opens that card and scrolls

  • Keyboard navigation: ↑/↓ between cards, Enter expands, Home/End jump to first/last, / focuses search

  • Expand all / Collapse all controls at the top of the Issues tab

  • Print-friendly: @media print forces all details open, inverts theme, hides charts

  • Filter bar: filter by severity, tier, verdict, epistemic label simultaneously

  • Sort controls: by severity, confidence, tier

  • Inline CSS/JS; Tailwind CSS, Chart.js, and Prism.js (v2.15, new) loaded via CDN

Chart.js wrapper-div mandate (v3.2.0). Every Chart.js <canvas> MUST be wrapped in a <div style="position: relative; height: 220px; width: 100%;"> (or equivalent class) with explicit pixel height. The bare <canvas height> attribute is a no-op when responsive: true and the dashboard always uses maintainAspectRatio: false — without a height-bounded parent, the canvas grows on every layout pass, producing infinite vertical growth on open, scroll, resize, or interaction. See issue #42 for the 2026-04-27 reproduction. The Phase 15.3 prompt enforces this; the test suite asserts every <canvas> has a position-relative height-bounded parent.

See references/prompt-templates.md for the Phase 15.3 agent prompt with the full 10-section schema and rendering spec.

Compressed-run banner (v3.1.0+): If the source Phase 15.1 markdown report begins with the ⚠️ COMPRESSED RUN blockquote, Phase 15.3 MUST render a prominent red banner at the top of the HTML body containing the same warning text. Suggested CSS:

<div role="alert" style="background:#FEE2E2; color:#991B1B; padding:1rem 1.25rem; margin:1rem 0; border:2px solid #DC2626; border-radius:6px;">
  <strong>⚠️ COMPRESSED RUN — Phases skipped: <list></strong>
  <p>This run did not complete the full panel protocol. ... Re-run the panel for a complete review.</p>
</div>

The banner appears above the report header summary card.

No-debate banner (v3.5.0): If the source Phase 15.1 markdown report contains the ⚠️ [NO-DEBATE] blockquote, Phase 15.3 MUST render a prominent amber banner at the top of the HTML body with the same warning text. Use a distinct amber/orange palette so it reads as separate from the red COMPRESSED banner; when both are present, render NO-DEBATE first, then COMPRESSED. Suggested CSS:

<div role="alert" style="background:#FEF3C7; color:#92400E; padding:1rem 1.25rem; margin:1rem 0; border:2px solid #D97706; border-radius:6px;">
  <strong>⚠️ [NO-DEBATE] — adversarial debate (Phase 5) did not run.</strong>
  <p>Reviewers evaluated independently but never cross-examined each other. The judge reconciled disagreements without a debate record — treat rulings as lower confidence. Re-run the full panel with debate for high-stakes decisions.</p>
</div>

The banner appears above the report header summary card (and above the COMPRESSED banner if both are present).


Phase 15 Verification Gate (MANDATORY — v2.16.4)

Before reporting completion, verify ALL THREE output files exist by checking that each file was successfully written (e.g., ls -la review_panel_report.md review_panel_process.md review_panel_report.html).

If all three files exist: proceed to the completion message below.

If review_panel_report.html is missing (Phase 15.3 failed):

  1. Log: "Phase 15.3 HTML report generation failed. Retrying..."
  2. Retry Phase 15.3 ONCE with the same disk-reading prompt (the agent reads from disk, so no orchestrator context re-assembly is needed)
  3. After retry, verify again
  4. If the file now exists: proceed to completion message
  5. If still missing after retry: report the two files that DO exist, and tell the user: "The HTML report could not be generated automatically. To generate it manually, say: generate the HTML review report"

Completion message (only after verification passes): Tell user:

  • Paths to all output files that were successfully written
  • Verdict + score (from primary report)
  • Counts: consensus points, disagreements, action items, verification verdicts
  • Top P0 action item (if any)
  • Note: HTML report requires internet connection for Tailwind CSS, Chart.js, and Prism.js CDNs
  • HTML footer should read "Agent Review Panel v3.5.0" (MUST match the full semver from plugin.json — update this line whenever the version is bumped)

Manual HTML Report Recovery (v2.16.4)

If the user asks to "generate the HTML report" or "generate the HTML review report" after a review has completed (whether Phase 15.3 failed or the user wants to regenerate), launch the Phase 15.3 agent with the same disk-reading prompt described above. Resolve all paths to absolute paths. The agent MUST:

  1. Read the Phase 15.1 output file (e.g., review_panel_report.md) for structured data — use the actual filename from the completed review
  2. Read the Phase 15.2 output file (e.g., review_panel_process.md) for verbatim content
  3. Read the skill's references/prompt-templates.md (absolute path) starting from "Phase 15.3: HTML Report Generation Prompt" for the rendering spec

Do NOT write a generic styled HTML page from the orchestrator's memory of the review. The spec in references/prompt-templates.md is authoritative — it specifies Tailwind CSS, Chart.js, Prism.js, the 10-section expandable accordion, Panel Gallery, filter logic, keyboard navigation, deep-linking, and print styles. Any HTML report that does not follow this spec is non-compliant.


Review-Mode Spectrum & Debate-in-Workflow (v3.5.0)

This skill is the full adversarial panel — its distinguishing feature is Phase 5–7 debate (reviewers cross-examine each other before a judge rules). Debate is expensive (sequential cross-talk) and only pays off when reviewers would genuinely change each other's verdicts. A 2026-06-06 audit of 51 real runs found debate ran in only 1 — most reviews were (correctly or not) routed to debate-less fast modes. Pick the mode deliberately:

Want Use Debate?
Fast eyes on a tiny PR code-review / single-agent-multi-persona-review no
Independent parallel lenses, small PR, autonomous multi-PR run parallel-panel-streamlined-no-debate no (by design)
Adversarial tradeoff, high-stakes gating, debate would change the verdict (security vs perf, "is this P0 real", merge go/no-go) this skill (full panel) — invoke as a skill, NOT a workflow yes (Phase 5–7)

Why "invoke as a skill, not a workflow" matters. Debate lives in this skill's Agent-tool orchestration. The Workflow / ultracode engine is a parallel fan-out engine (parallel() / pipeline() — agents never see each other) whose canonical recipe is literally "find → verify → judge". Running "review this in ultracode" therefore produces a structurally debate-less run, and the panel's NO-DEBATE banner will fire. If you want the panel's depth under a Workflow, you must author debate as an explicit phase.

Debate-in-Workflow recipe (ultracode-mode)

Debate IS achievable inside a Workflow — it just isn't the default shape. The trick: a debate "round" is just re-spawning each reviewer agent with its peers' prior-round findings injected (it reads peer state files). Sequential cross-talk becomes a pipeline where stage N reads stage N−1's siblings:

// 1. Round 1 — independent reviews, in parallel (no cross-talk yet).
const round1 = await parallel(PERSONAS.map(p => () =>
  agent(`Review the work as ${p.name}. Write findings to state/reviewer_${p.key}_phase_5_round1.md`,
        {phase: 'Review', schema: FINDINGS_SCHEMA})));

// 2. Debate round 2 — each reviewer re-runs WITH every peer's round-1 findings
//    as input, and rebuts/revises. This is the cross-examination.
const round2 = await parallel(PERSONAS.map((p, i) => () =>
  agent(`You are ${p.name}. Your round-1 findings: ${JSON.stringify(round1[i])}.
         Your peers' round-1 findings: ${JSON.stringify(round1.filter((_,j)=>j!==i))}.
         Where do you concede, push back, or find a NEW issue their angle exposes?
         Write to state/reviewer_${p.key}_phase_5_round2.md`,
        {phase: 'Debate', schema: REBUTTAL_SCHEMA})));

// 3. Judge reconciles WITH the debate record (not alone).
const ruling = await agent(`Adjudicate. Read the round-1 and round-2 state files;
  rule on each disagreement citing how the debate moved (or didn't).`,
  {phase: 'Judge', schema: RULING_SCHEMA});

Authoring an explicit Debate phase (the audit's one debating run used phases [Review, Debate, Audit+Verify, Judge]) is what makes the round-1 state files exist — which in turn satisfies the NO-DEBATE check. A Workflow that skips the Debate phase will (correctly) get the NO-DEBATE banner.


Multi-Run Union Protocol (v2.14)

A single panel run catches ~60–70% of discoverable issues. Independent runs with rotated persona compositions have only ~30% finding overlap — meaning each run catches issues the others miss. For high-stakes reviews, the Multi-Run Union Protocol runs the panel N times and merges results.

Invocation

  • Flag: --runs N (explicit count)
  • Natural language: "run 3 times and merge", "multi-run review", "run twice with different reviewers", "maximum coverage review"
  • Default: N=1 (no merge, single-run mode)
  • "Multi-run" without N: defaults to 2

Persona Rotation Schedule

Deterministic given the run number. Run 1 uses the base set; subsequent runs use complementary sets to maximize coverage diversity.

Run # Persona Set Purpose
1 Standard content-type base set + signal specialists Canonical review
2 Complementary: Code Quality Auditor, Performance Specialist, Methodology Analyst, DA + DIFFERENT signal specialists than Run 1 Catch what Run 1 missed
3 Adversarial-heavy: 3 Devil's Advocates (different reasoning strategies) + 1 Correctness Hawk Stress-test consensus
4+ Cycle through 1–3 with shuffled signal specialists Diminishing returns

Run 3 Devil's Advocates use different reasoning strategies:

  1. Analogical reasoning ("compare to known failure patterns from similar projects")
  2. Adversarial simulation ("imagine you are an attacker / malicious user")
  3. Failure mode enumeration ("list every way this could fail in production")

Key Rules for Multi-Run Mode

  1. Content classification runs ONCE (in Run 1). The classification is FIXED for all subsequent runs. This eliminates the primary source of cross-run non-determinism documented in the consistency analysis.
  2. Phase 2 (Data Flow Trace) runs ONCE (in Run 1). The Data Flow Map is cached and shared with all subsequent runs. The trace is deterministic for a given codebase; re-running would not produce different paths.
  3. Each run independently executes Phases 3–15 with its own persona set.
  4. Per-run reports are written to review_panel_report_run{N}.md.
  5. After all runs complete, Phase 16 (Merge) runs once to produce the final merged review_panel_report.md.
  6. Runs MAY execute in parallel if the orchestrator supports it (launching multiple run orchestrations as parallel background agents). Sequential execution is also acceptable.

Phase 16: Merge (v2.14, multi-run only)

Single agent (model: "opus"). VoltAgent mapping: voltagent-meta:knowledge-synthesizer (always pass model: "opus").

The Merge Agent receives all N per-run reports and executes:

  1. Collect all findings from all runs, preserving severity, location, bug class, epistemic label, and source run number.

  2. Deduplicate by semantic similarity. Two findings are duplicates if AND ONLY IF:

    • Same location (same file AND same function, OR lines within 10 of each other)
    • AND same bug class
    • Different bug classes at same location → keep both
    • Same bug class at different locations → keep both
    • When in doubt, prefer keeping duplicates over false merging
  3. Score stability. For each merged finding, count how many runs produced it:

    • [N/N RUNS] — found in every run, highest confidence
    • [K/N RUNS] (1 < K < N) — found in multiple runs, medium-high confidence
    • [1/N RUNS] — single-run finding, NOT demoted. Single-run findings often represent unique persona insights that only one configuration surfaced. The consistency analysis proved single-run P0s are often the most valuable findings.
  4. Resolve severity disagreements. When runs disagree on severity for a merged finding, use the HIGHEST severity from any run (conservative: false negatives are invisible while false positives are visible and dismissible). Note the range: "P0 (Run 1) / P1 (Run 2)".

  5. Resolve judge divergence. If per-run judges gave scores more than 2 points apart, flag [JUDGE_DIVERGENCE], explain what drove the difference (different persona focus? different threat model?), and provide an independent merged assessment.

  6. Produce the merged report at review_panel_report.md. Per-run reports remain at review_panel_report_run{N}.md for audit trail.

Merged Report Additions

The single-run Phase 15.1 report format is extended with:

New header fields:

**Runs:** {N} (personas rotated per schedule)
**Run stability:** {X}% of findings appeared in 2+ runs
**Unique to single run:** {Y} findings

New required section:

## Run Comparison
| Finding | Run 1 | Run 2 | Run 3 | Merged Severity | Stability |

New label type in Scope & Limitations:

Stability labels: [N/N RUNS] (high confidence) [K/N RUNS] (medium) [1/N RUNS] (single-angle)

Action items gain a stability label:

1. **[P0] [VERIFIED] [2/2 RUNS]** Add mutex lock around token refresh
2. **[P1] [CONSENSUS] [1/2 RUNS]** Sanitize error messages *(Run 2 only: Security Auditor)*

See references/prompt-templates.md for the full Phase 16 Merge Agent prompt.


Implementation Notes

State files (v3.1.0+)

Subagent outputs for Phases 3, 4, 5, 7, 8, 10, 11, and 14 are written to disk under a state/ subdirectory of the review output directory, then the subagent returns only the file path plus a 100-word summary. The orchestrator reads files on demand rather than holding verbatim subagent outputs in its context window.

Reviewer state files use the naming convention state/reviewer_<name>_phase_<N>.md (where <name> is the persona slug and <N> is the phase number); orchestrator-level state files include state/phase_8_audit.md, state/phase_10_claim_verification.md, state/phase_11_severity_verification.md, state/phase_14_judge_ruling.md, and state/phase_14_5_judge_verification.md (v3.2.0).

Single-run layout:

docs/reviews/<date>-<topic>/
├── state/
│   ├── reviewer_<name>_phase_3.md         # independent review
│   ├── reviewer_<name>_phase_4.md         # private reflection
│   ├── reviewer_<name>_phase_5_round1.md  # debate response
│   ├── reviewer_<name>_phase_7.md         # blind final assessment
│   ├── phase_8_audit.md
│   ├── phase_10_claim_verification.md
│   ├── phase_11_severity_verification.md
│   ├── phase_14_judge_ruling.md
│   └── phase_14_5_judge_verification.md      # v3.2.0 — post-judge gate
├── review_panel_report.md                  # Phase 15.1
├── review_panel_process.md                 # Phase 15.2
└── review_panel_report.html                # Phase 15.3

Multi-run layout (Phase 16):

docs/reviews/<date>-<topic>/
├── state/
│   ├── run_1/reviewer_<name>_phase_3.md
│   ├── run_1/reviewer_<name>_phase_4.md
│   ├── ...
│   ├── run_2/reviewer_<name>_phase_3.md
│   └── ...

Each run's state lives under state/run_<N>/ (e.g. state/run_1/reviewer_<name>_phase_3.md, state/run_2/reviewer_<name>_phase_3.md). The merge step (Phase 16) reads state files from each run independently when computing union findings.

This pattern mirrors overnight-insight-discovery, successor-handoff, and cloud-run-results-bq-postsync — every long-running multi-agent skill in the local catalog routes intermediate outputs through disk to keep the orchestrator window small.

  • Parallel execution: Phases 3, 4, 5, 7 use single message with multiple Agent tool calls. Phases 2, 8, 9, 10, 11, 12, 13, 14 are sequential (Phase 9 is orchestrator-driven via Bash, not a subagent). Phase 12a is orchestrator logic (no agent). Phase 12b is a single Opus agent. Phase 13 agents launch in parallel (single message with one Agent call per dispute point). Phases 15.1, 15.2, and 15.3 run in strict sequence (15.1 → 15.2 → 15.3). Phase 15.3 runs AFTER 15.2 so its agent can read the already-written files from disk instead of requiring the orchestrator to inject all data in-context.
  • Context management: Full content in Phases 2, 3, 8, 14. Phase 6 summaries with source excerpts in debate rounds for long works (>500 lines).
  • Error handling: Retry failed agents once. Proceed with minimum 2 reviewers. Note gaps in report. Phase 15.3 has an explicit verification gate (v2.16.4): if the HTML file is missing after the agent returns, retry once before degrading to 2-file output with a manual recovery instruction for the user.
  • Idempotent: Safe to re-run on the same content — each invocation produces an independent panel with no side effects from previous runs.
  • Auto-persona algorithm: Classify → base set → signal scan → add up to 6 → replace DA first. See references/signals-and-checklists.md for signal table.
  • Multi-run execution (v2.14): When --runs N > 1, Phase 1 runs once (shared classification + signal detection + context brief), Phase 2 runs once in Run 1 (cached Data Flow Map), then Phases 3–15 repeat N times with rotated personas, then Phase 16 merges. Runs MAY execute in parallel (independent orchestrations) or sequentially.
  • Force opus (v2.14): ALWAYS pass model: "opus" when launching agents, even with subagent_type. VoltAgent agents may have sonnet/haiku defaults in their frontmatter; without explicit override, reviewer reasoning depth varies across runs. This was an invisible source of cross-run variance in v2.9–v2.13.

Edge Cases

  • No content provided: Ask user what to review. Do not launch a panel with empty input.
  • Very large files (>500 lines): Use Phase 6 summaries with excerpts instead of full content in debate rounds. Cap at 20k lines total.
  • Binary/image files: Skip. Note in report: "Binary files excluded from review."
  • Single tiny file (<20 lines): Reduce to 2 reviewers (minimum). Full panel is overkill.
  • No P0/P1 findings: Skip Phases 9 and 11. Proceed directly to claim verification.
  • No unresolved disputes or unverified action items: Skip Phases 12 and 13. Proceed directly to Phase 14.
  • All reviewers agree (score spread < 2): Flag correlated-bias warning in report. Do NOT skip debate — unanimous agreement is the most dangerous failure mode.
  • Phase 2 skipped (v2.14): For pure docs/plans, or code with no data transforms (pure API routing, static config), skip Phase 2 entirely. Note reason in Context Brief and report header: "Data flow trace: Skipped ({reason})". Proceed directly to Phase 3.
  • Single-run mode (v2.14): --runs 1 (default) skips Phase 16 (Merge). Report is written directly to review_panel_report.md by Phase 15.1. No stability labels. No Run Comparison section.
  • Multi-run with N > 3 (v2.14): Persona rotation cycles through Runs 1/2/3 schedule with shuffled signal specialists. N > 4 has diminishing returns — warn the user that marginal finding discovery drops sharply after Run 3.
  • Multi-run judge divergence (v2.14): If per-run judge scores span > 2 points, Phase 16 flags [JUDGE_DIVERGENCE] and provides an independent merged assessment rather than averaging.
  • Exhaustive trace on very large codebases (v2.14): No token budget limit. If the file is > 20k lines, Phase 2 may take > 30 min. Warn the user and offer Thorough tier as alternative.
  • HTML report soft size cap (v2.15): Target 150–250KB, soft cap 500KB. If the combined structured data (all 10 expandable sections across all findings) exceeds 500KB, the Phase 15.3 agent SHOULD offer a "slim" mode that drops verbatim fullEvidence and debateTranscript content (replacing with summaries). Slim mode is indicated in the report header and footer.
  • Prism.js CDN unreachable (v2.15): If the Prism.js CDN fails to load, code evidence blocks render as unstyled <pre><code> elements (still readable, just without syntax colors). Wrap Prism calls in try/catch to prevent a CDN failure from breaking the page. This is consistent with the existing graceful-degradation approach for Tailwind and Chart.js CDN failures.
  • Empty expandable sections (v2.15): When a finding lacks data for any of the 10 accordion sections (e.g., no debate, no prior runs), render a "No {section} data" placeholder instead of omitting the section. Every expanded card must show all 10 sections in the same order for consistent structure. This prevents the v2.13 nice-shtern compliance gap where agents silently omitted the expand button when evidence fields were empty.

For full prompt templates, see references/prompt-templates.md. For version history, see references/changelog.md.

Install via CLI
npx skills add https://github.com/wan-huiyan/agent-review-panel --skill agent-review-panel
Repository Details
star Stars 19
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator