oracle-consensus - SKILL.md Agent Skill

name: oracle-consensus model: opus description: Run 2x oracle sessions (FOR + AGAINST stances) to validate design decisions, plans, or bead readiness. Default = two Fable subagents; escalate to PAL 2x GPT-Pro (always paired with Fable) for extremely important or complex validations. Use after design rounds, before implementation, or to challenge architecture decisions. argument-hint: " [--rounds N] [--models M1,M2]"

Oracle Consensus

Oracle policy (2026-06): Fable (claude-fable-5) is the primary oracle — the default FOR/AGAINST debate runs as two fresh Fable subagents (Agent tool with model: "fable"; Fable is not reachable through PAL). The PAL 2x GPT-Pro consensus is the escalation tier, for extremely important or complex validations only, and must always run alongside a Fable consultation on the same question. See /consult-oracles and the Oracle Consultation Policy in ~/CLAUDE.md.

Run a structured FOR/AGAINST debate between two high-capability oracle sessions on a design decision, plan, or bead set. Produces a scored verdict with specific actionable corrections.

When to Use

After a design swarm to validate decisions
Before committing to an architecture direction
After bead creation to validate readiness
When two approaches seem equally valid
Any time "validate this with oracles" is requested

Prerequisites

For the escalation tier only: verify PAL MCP is running before launching GPT oracles. If mcp__pal__listmodels fails or returns empty, alert the user — agents silently fall back to self-analysis without PAL, producing unreliable results. The default Fable tier needs no PAL.

Workflow

Step 1: Frame the Evaluation

Write a clear, specific evaluation prompt. This is what both models will see.

Good framing:

Evaluate this sensor plate settings redesign for a WPF desktop app:
- Plan: [summary of key decisions]
- Key constraint: Must support 1-8 sensor plates, composite plates, and auto-detection
- Files affected: [list]
- Risk areas: [list]
Score 1-10 on: correctness, completeness, feasibility, UX quality, architecture quality

Bad framing:

Is this plan good?

The evaluation prompt must be self-contained — models do not share context between stances.

Step 2: Configure Stances

Default tier (Fable): spawn two fresh Fable subagents in a single message with opposing stances. Each prompt is self-contained (evaluation prompt + file paths); the agents share no context.

Agent({ subagent_type: "general-purpose", model: "fable",
  prompt: "<evaluation prompt>\n\nStance: Advocate for this design. Identify its strengths, explain why the decisions are sound, and argue that it should be approved. Be specific — cite exact decisions and explain their merit. Score honestly; 'for' does not mean blindly positive." })

Agent({ subagent_type: "general-purpose", model: "fable",
  prompt: "<evaluation prompt>\n\nStance: Challenge this design. Find weaknesses, missing considerations, contradictions, and risks. Propose specific corrections for each issue found. Score honestly; 'against' does not mean blindly negative." })

Then skip to Step 4 and synthesize the two responses yourself.

Escalation tier (PAL 2x GPT-Pro — extremely important/complex only, always paired with a parallel Fable consult on the same evaluation prompt):

{
  "models": [
    {"model": "gpt-5.5-pro", "stance": "for", "stance_prompt": "Advocate for this design. Identify its strengths, explain why the decisions are sound, and argue that it should be approved. Be specific — cite exact decisions and explain their merit. Score honestly; 'for' does not mean blindly positive."},
    {"model": "gpt-5.5-pro", "stance": "against", "stance_prompt": "Challenge this design. Find weaknesses, missing considerations, contradictions, and risks. Propose specific corrections for each issue found. Score honestly; 'against' does not mean blindly negative."}
  ]
}

Alternate Configurations (escalation tier)

Architecture validation (3 models):

[
  {"model": "gpt-5.5-pro", "stance": "for"},
  {"model": "gpt-5.5-pro", "stance": "against"},
  {"model": "gemini-3.1-pro-preview", "stance": "neutral", "stance_prompt": "Provide an independent technical assessment. Focus on feasibility, risk, and alternatives the other evaluators may miss."}
]

Bead readiness (2 models, specific stance prompts):

[
  {"model": "gpt-5.5-pro", "stance": "for", "stance_prompt": "Argue these beads are implementation-ready. Each bead should have: clear ACs in Given/When/Then, correct file paths, correct dependencies, no spec contradictions, and be self-contained."},
  {"model": "gpt-5.5-pro", "stance": "against", "stance_prompt": "Find beads that are NOT ready. Look for: vague ACs, wrong file paths, missing dependencies, contradictions between beads, beads too large for atomic implementation, cross-cutting requirements not embedded."}
]

Step 3: Run Consensus (escalation tier only)

Use the PAL MCP consensus tool, with the paired Fable subagent launched in the same message. The tool manages the multi-step flow internally:

Step 1 (your analysis): Write the evaluation prompt and your own independent assessment
Steps 2-N (model consultations): Each model responds with its stance
Final step (synthesis): You synthesize all responses into a verdict

mcp__pal__consensus(
  step="Evaluate the following design for [topic]:\n\n[evaluation prompt]\n\n[relevant context]",
  step_number=1,
  total_steps=4,  // 1 (your analysis) + N models + 1 (synthesis)
  next_step_required=true,
  findings="[your independent analysis before seeing model responses]",
  models=[...],
  relevant_files=["/absolute/path/to/plan.md", ...]
)

Step 4: Extract Corrections

From the synthesized consensus, extract:

Score (1-10) with breakdown by category
Unanimous findings — both FOR and AGAINST agree (highest confidence)
Contested findings — disagreement between stances (needs human judgment)
Corrections — specific, actionable changes to make

Format corrections as:

## Oracle Consensus Results

**Overall Score:** 8/10
**Unanimous:** 3 findings | **Contested:** 1 finding | **Corrections:** 5

### Unanimous Findings
1. [Finding both stances agreed on]

### Corrections (ordered by impact)
1. **CRITICAL:** [correction] — Reason: [why]
2. **HIGH:** [correction] — Reason: [why]
3. **MEDIUM:** [correction] — Reason: [why]

### Contested
1. FOR says [X], AGAINST says [Y] — **Recommendation:** [your judgment]

Step 5: Apply Corrections

For each correction:

Verify it against the original plan/code/beads
Apply if valid; reject with rationale if not
Track applied vs rejected corrections

Step 6: Optional Re-validation

If corrections were extensive (5+ CRITICAL/HIGH), run a second oracle round on the corrected version. Use a shorter evaluation prompt focused on whether corrections were properly applied.

Convergence criterion: Stop when oracle score is 8+ AND zero CRITICAL corrections remain.

Multiple Concurrent Oracles

For large targets (e.g., 80 beads across 10 epics), run parallel oracle sessions:

Oracle 1: Evaluate epics 1-3 (data model + safety)
Oracle 2: Evaluate epics 4-6 (UI + UX)
Oracle 3: Evaluate epics 7-10 (integration + testing)
Oracle 4: Evaluate cross-cutting concerns (deps, ordering, completeness)

Each oracle session is independent. Compile all corrections after all complete.

Scoring Guide

Score	Meaning	Action
9-10	Excellent, minor polish only	Ship it
7-8	Good, specific corrections needed	Apply corrections, no re-validation needed
5-6	Significant issues	Apply corrections + re-validate
1-4	Fundamental problems	Redesign required

Example Invocations

/swarm-oracle foundation/product/features/sensor-plate-implementation-plan.md
/swarm-oracle "Should we split StationConfigurationVM into 5 VMs or 4?"
/swarm-oracle --rounds 2 .beads/  # validate all open beads, re-validate if needed