review-swarm - SKILL.md Agent Skill

name: review-swarm description: Run clean-room multi-agent loops across Claude/Gemini/Codex/OpenCode with strict review-contract checks, fallback policy, and convergence gates.

Review Swarm (multi-backend)

This skill provides a reusable clean-room swarm harness for independent reviewers/analysts.

Core capabilities:

Run N agents with run_multi_task.py.
Mix backends: OpenCode, Claude CLI, Codex CLI, Gemini CLI.
Enforce strict review output contract (optional).
Apply fallback policy when a target backend fails/returns invalid output.
Record deterministic artifacts (trace.jsonl, meta.json, outputs).
Gate on convergence (optional Jaccard similarity).

Canonical entrypoint

Use scripts/bin/run_multi_task.py for all new workflows.

Primary public skill name: review-swarm. Use review-swarm consistently in documentation and automation references.

Requirements

Install runner skills for any backends you plan to use:

opencode-cli-runner (for OpenCode backend)
claude-cli-runner (for claude/... models)
codex-cli-runner (for codex/... models)
gemini-cli-runner (for gemini/... models)

CLIs should be available in PATH according to the chosen backends.

Host-aware execution (your own family runs native; quality first)

run_multi_task.py shells out to CLIs — that is for CROSS-family reviewers. Host capabilities VARY; gate on what your host exposes:

Single-family review → keep it in-host, not via that family's CLI. If you only need YOUR family's reviewer (no cross-family swarm aggregation), run it in-host: use a native child-agent/sub-agent primitive if your host has one (Claude Code's Agent/Task tool; OpenCode subagents), else run it inline in your own loop — don't claude exec a model you are already running as (latency, separate auth/session, context loss). Plain Claude Desktop / the Gemini CLI may have no sub-agent primitive → inline.
Cross-family swarm → all reviewers go through run_multi_task.py (honest caveat). Its convergence / contract aggregation is computed over the runner's OWN output files, so a natively-run same-family reviewer would not be in the swarm. Getting one unified multi-backend verdict therefore means your own family also goes through its CLI here — that hop is the price of in-process aggregation. Use the swarm when you need cross-MODEL review; do single-family reviews natively.
Reasoning effort scales with review difficulty — quality first, not token thrift. High-stakes, cross-package, or security-sensitive reviews warrant maximum thinking (extended thinking / high–xhigh reasoning effort / a stronger model); trivial diffs do not. Never accept a missed defect to save tokens.
For a long/expensive swarm, prefer a steerable background task chip (e.g. Claude Code spawn-task) the user can inspect and adjust mid-run, when the host supports one; otherwise run inline and checkpoint. Capability varies by host — degrade gracefully.

Quick start (multi-agent)

python3 scripts/bin/run_multi_task.py \
  --out-dir /tmp/multi_review \
  --system /path/to/system.md \
  --prompt /path/to/task.md \
  --agents 3

Quick start (dual review: Claude + Gemini)

CLI-only / cross-host example. Inside a Claude host, run the Claude reviewer as a native sub-agent and use the runner only for the non-Claude lane (--models gemini/default) — see Host-aware execution above.

python3 scripts/bin/run_multi_task.py \
  --out-dir /tmp/dual_review \
  --system /path/to/reviewer_system_claude.md \
  --prompt /path/to/packet.md \
  --models claude/default,gemini/default \
  --backend-tool-mode claude=review \
  --backend-tool-mode gemini=review \
  --backend-prompt gemini=/path/to/gemini_prompt.txt \
  --backend-output claude=claude_output.md \
  --backend-output gemini=gemini_output.md \
  --check-review-contract

Backend overrides

run_multi_task.py supports per-backend overrides:

--backend-prompt backend=/path/to/prompt
--backend-prompt @/path/to/overrides.json (batch mode)
--backend-system backend=/path/to/system or backend=none
--backend-output backend=relative_or_absolute_path
--backend-tool-mode backend=mode
--timeout-secs N

Notes:

These flags are repeatable.
--backend-prompt @json supports:
- shorthand prompt map: {"gemini": "/path/to/gemini_prompt.txt"}
- batch object: {"prompt": {...}, "system": {...}, "output": {...}}
Relative --backend-output paths are resolved under --out-dir.
claude=none for --backend-system is rejected (Claude runner requires a system prompt file).
For a single run, --backend-output does not allow one path for repeated same-backend agents (to avoid output clobbering).
--timeout-secs is a per-backend hard timeout. Default: 900 seconds. Use 0 to disable.
--backend-tool-mode is explicit and backend-specific:
- claude=none|review
- gemini=none|review
- opencode=none|workspace

Reviewer Tool Modes

Default behavior is explicit:

Claude, Gemini, and OpenCode now receive an explicit tool mode from review-swarm.
The default mode is none for all three backends.
Tool access must be opted into per backend with --backend-tool-mode.

Reviewer-safe modes:

claude=review: maps to a read-only built-in tool profile (Read,Glob,Grep).
gemini=review: maps to Gemini CLI --approval-mode plan plus local CLI execution (--no-proxy-first), sandboxing, and --extensions none, which is Gemini's read-only review path.
When Gemini is in review mode and --gemini-cli-home was not explicitly set, review-swarm now synthesizes an isolated GEMINI_CLI_HOME under the run output directory and writes a minimal user settings file there (mcpServers={}, mcp.allowed=[]) to avoid inheriting reviewer-external user MCP state by default.
Gemini review is a headless review path, not the same interaction mode as the Gemini TUI /mcp session. If this path emits MCP discovery noise or does not yield a usable source-grounded verdict on a large packet, prefer a same-model rerun with an embedded-source packet and gemini=none rather than assuming TUI MCP health guarantees headless review stability.

OpenCode caveat:

opencode=workspace explicitly grants workspace visibility by passing --workspace-dir.
For formal workspace reviews, prefer OpenCode's official headless-server flow (opencode serve + opencode run --attach ...) rather than relying only on repeated direct run --dir ... cold starts.
Current opencode run CLI does not expose a built-in read-only tool allowlist comparable to Claude/Gemini, so workspace is explicit workspace access, not a hard no-mutation guarantee.
For opencode=workspace, prefer workspace-relative file paths in prompts/packets. Large prompts that enumerate absolute workspace paths or globs can push the model into external_directory permission requests even when the repo itself is mounted as the workspace.
Treat OpenCode workspace and OpenCode embedded-source as two different review roles:
- workspace: packet-challenge / discovery reviewer. Best when blast radius or hidden front-door / consumer drift is still uncertain.
- none + embedded-source packet: verdict-normalization / formal gate reviewer. Best when scope is already narrowed and you need a stable closeout artifact.
Do not treat an OpenCode workspace pass as "failed" just because the output includes exploratory text or lacks a clean final JSON block. If it still contains source-grounded, current-worktree findings, keep that review signal and only rerun same-model to normalize the gate artifact.
For formal reviewer use, prefer Claude/Gemini for source-grounded read-only review guarantees; treat OpenCode workspace mode as discovery-strong but gate-fragile, and reserve embedded-source OpenCode passes for final formal-verdict stabilization once packet scope is adequate.
When packet scope touches public/package/CLI/workflow/default-entry surfaces, also follow the Front-door Surface Audit requirement in AGENTS.md; runner setup does not replace packet widening.

Execution adversary (mandatory for correctness-critical / method-precondition reviews)

A read-only review is a static read; it cannot confirm a runtime property. When a review must establish that a method's load-bearing precondition actually holds — an operator identity (commutation with a projector/symmetrizer, Hermiticity, self-adjointness, idempotency, unitarity, variational/Galerkin-subspace invariance), a numerical invariant, or a true-operator eigen-residual — at least one reviewer must take an "execution adversary" role: load the artifact and execute the disconfirming test at the production scale/configuration, not statically read the code. Give that reviewer real execution access (a host-native sub-agent with run/Bash, or a sandbox that can execute), and record in meta.json whether each reviewer executed vs. only read the precondition checks. A swarm in which no reviewer executed the precondition is a static-only swarm and must be labeled as such — it does not count as a precondition pass. (A static read can certify code shape; only execution at the production scale can certify that a discretized / implemented property actually holds — a property can read as correct and still fail numerically above the minimal size.)

Source-fidelity reviewer (mandatory for transcription / source-extraction artifacts)

A source-extraction / transcription note — a deep-read / knowledge-base note that transcribes equations, numeric values, source locators, and term-by-term mappings onto a consuming artifact from a primary source — is a valid gate target, not a gate-exempt "reading task." Its primary observable is fidelity to the source, so the review is a different shape from a code/design review: at least one cross-model-family reviewer must do a LITERAL, line-by-line comparison of the note against the primary source with "do not trust the note." Loose semantic agreement is insufficient — transcription drift (a flipped sign, a dropped magnitude factor, a transposed digit, a stale locator, or a stale mapping to the consuming artifact) reads as plausible and is caught only by literal comparison. Reviewer model-family diversity materially strengthens this gate: a same-family looser read tends to pass exactly the defects it is meant to catch.

Give that reviewer the persisted primary source (the exact bytes that were transcribed), not the note alone, plus the transcription/extraction failure checklist (research-integrity → Extraction / transcription fidelity, items (a)–(g)). Record in meta.json whether a literal cross-family source comparison was performed; a swarm that only read the note, or stayed within one model family, is not a fidelity pass and must be labeled as such.

Artifact-integration reviewer (for rendered research artifacts)

When a workflow turns source-read notes into a rendered artifact — for example an interactive literature graph, slide deck, dashboard, or browsable note bundle — include at least one reviewer whose task is to inspect the current rendered artifact and its source files, not merely the synthesis prose. This reviewer checks integration failures that source-fidelity review alone cannot see: broken relative links, missing images, unrendered math, non-clickable connected references, stale note paths, layout collisions, and a renderer that displays placeholders or filenames instead of the intended evidence.

Write reusable workflows in terms of reviewer roles and capabilities, not specific model names. A concrete run may choose particular models, but the skill or project contract should say "independent cross-model artifact reviewer" or "source-fidelity reviewer" unless a user explicitly pins a model for that run. After any artifact fix, rerun the reviewers on the fixed artifact before calling convergence.

Reference-reproduction reviewer (mandatory for "matches / reproduces a published value" claims)

A claim that a result reproduces / matches / agrees with a published reference value is a quantitative claim a static read cannot certify — reading the prose only confirms the prose. When a packet asserts such a match, at least one reviewer must take a "reference-reproduction" role and cover two distinct dimensions that a correctness / methodology / honesty review routinely passes over:

D1 — recompute and compare. Compute the claimed observable on a comparable state / regime / configuration and compare to the published number numerically — do not accept a qualitative "same order of magnitude / same sign / right scale" assertion, and do not accept the citation as if citing the source proved the match. Compare term by term where the claim is term-level (a net total can agree while individual contributions are suppressed or sign-flipped). An order-of-magnitude same-direction discrepancy, or a sign reversal, is a BLOCKING finding, not a pass. Give this reviewer real execution access (a host-native sub-agent with run/Bash, or an executing sandbox) when the comparison requires computation.
D2 — the independent cross-check did not silently lapse. Confirm that any cross-validation evaluates the same model by a different route. A structurally different-model engine, or a check valid only in a degenerate / limit regime, must be labeled as a different-model / limit-regime comparison, not presented as validation; and when no apples-to-apples independent check is feasible, the absence is recorded as an explicit stated limitation rather than an established cross-check being allowed to silently disappear.

Record in meta.json whether a reviewer computed-and-compared vs. only read the match assertion; a swarm in which no reviewer recomputed the claimed observable on the comparable state is a static-only swarm for that claim and must be labeled as such — it does not count as a reference-match pass. Cross-model-family diversity strengthens this gate. Pair it with numerical-reliability-gate G8 (the compute-and-compare gate, returning reference_mismatch on an order-of-magnitude or sign gap) and the research-integrity Reference-reproduction fidelity dimensions.

Model selection

--agents N: rotate through available OpenCode config models.
--models a,b,c: explicit model specs.
--model default: one OpenCode agent, CLI default model.
Mixed backends supported: claude/..., codex/..., gemini/..., OpenCode provider/model.

Default-model policy (hard rule)

When model is omitted or set to default, do not inject historical model names. Always delegate to each backend CLI's configured default model.

This rule applies to all backends:

OpenCode
Claude CLI
Codex CLI
Gemini CLI

Fallback policy

Fallback can be enabled for target backends (default target: gemini):

--fallback-mode off (default)
--fallback-mode ask (exit code 4, asks for rerun decision)
--fallback-mode auto (tries --fallback-order, default codex,claude)

Example:

python3 scripts/bin/run_multi_task.py \
  --out-dir /tmp/dual_review \
  --system /path/to/system.md \
  --prompt /path/to/prompt.md \
  --models claude/default,gemini/default \
  --check-review-contract \
  --fallback-mode auto \
  --fallback-order codex,claude

Prompt-size guardrail (optional)

--max-prompt-bytes N or --max-prompt-chars N
--max-prompt-overflow fail|truncate

When enabled, guardrails apply to global inputs and backend override inputs.

Convergence check

python3 scripts/bin/run_multi_task.py \
  --out-dir /tmp/multi_review \
  --system /path/to/system.md \
  --prompt /path/to/task.md \
  --models claude/opus,gemini/default \
  --check-convergence \
  --convergence-threshold 0.8

Re-review after every fix (gate-loop discipline)

Convergence is a property of the reviewers' agreement on the current artifact, never a self-pronouncement after applying a fix. The gate loop is review → fix → re-run the independent reviewers on the fixed artifact → repeat, and it converges only when the reviewers themselves return clean. Re-review after every correction round, including ones that look trivial or single-line: a fix can introduce a new defect — a corrected transcription line that silently drops a magnitude factor, or a refactor that re-breaks an invariant — that exists only after the fix and is caught only by the next independent round. Skipping the confirmation round because the change "obviously" closed the finding is the failure mode this rule exists to stop. The leader integrates and decides, but does not declare convergence in place of the reviewers.

Contract checking (informational)

--check-review-contract validates output format compliance and records results in meta.json. Contract failures are informational only — they never trigger fallback. Content matters more than format.

If you want models to output a specific format, include format instructions in your system/user prompt.

Standalone checker:

python3 scripts/bin/check_review_output_contract.py /tmp/dual_review/claude_output.md

Contract auto-detects output format:

Markdown: VERDICT: READY/NOT_READY first line + required headers (## Blockers, etc.)
JSON: Valid JSON object with blocking_issues (array), verdict (PASS/FAIL), summary

JSON outputs wrapped in markdown code fences (```json ... ```) are automatically unwrapped.

Outputs

{out-dir}/agent_*_*.txt (or backend output override paths)
{out-dir}/trace.jsonl
{out-dir}/meta.json

Runner parity notes

System prompt delivery

All backends now receive the system prompt by default. However, the delivery mechanism differs:

Runner	Delivery	True system role?
claude-cli-runner	`--system-prompt` native arg	Yes
codex-cli-runner	Merged into stdin (`=== System Instructions ===` + `=== Task ===`)	No — prepended to user message
gemini-cli-runner	Concatenated into stdin (`system + \n\n + prompt`)	No — prepended to stdin
opencode-cli-runner	Concatenated into stdin (same as gemini)	No — prepended to stdin

Only Claude CLI uses a true system role with elevated priority. The other three runners prepend the system prompt as a user-message prefix. This is a CLI limitation, not a bug.

File access

Runner	File access	Notes
Codex	`--sandbox read-only`	Can browse the codebase
Gemini	Default headless Gemini CLI mode	Review-safe tool access is opt-in via `--backend-tool-mode gemini=review`
Claude	`--tools` parameter	Review-safe tool access is opt-in via `--backend-tool-mode claude=review`
OpenCode	Workspace exposure is explicit	`--backend-tool-mode opencode=workspace` exposes the workspace, but not with a hard read-only allowlist

Implications for review weight

Codex reviews may reference specific files/lines thanks to sandbox access — treat as higher-confidence for implementation details.
Gemini reviews now default to standard headless mode unless review-safe tools are explicitly enabled.
Claude reviews now default to no built-in tools unless review-safe tools are explicitly enabled.
OpenCode reviews default to isolated, prompt-driven runs unless workspace access is explicitly enabled.
System prompt parity ensures all backends share the same review criteria (BLOCKING/HIGH/LOW taxonomy, output format).

Skill name note

Use review-swarm as the canonical external name. Use review-swarm consistently during migration and in new integrations.