name: decision-stress-test
description: |
Somete a estrés una decisión técnica antes de comprometerse, mediante 5-12 perspectivas en paralelo (Outsider, Adversary, Simplifier, Maintainer, Linus, Cost, Performance, Operator, Karpathy, Security, Product, User) con rondas de cross-debate, más 5 técnicas adversariales (Steel-Man, Assumption Audit, Pre-Mortem, Inversion, Second-Order), calibrado por stakes, con verificación anti-alucinación en cada finding.
Úsala cuando: a punto de comprometer una elección de arquitectura/librería, quieres cuestionar una solución propuesta, sospechas sesgo de complacencia, "cuestiona esta decisión", "abogado del diablo", "pre-mortem", "estoy seguro de esto".
Keywords - stress-test, challenge this decision, devils-advocate, contrarian analysis, steel-man, pre-mortem, before-deciding, before-committing, antes-de-decidir, cuestiona esta decisión, stress-testea
disable-model-invocation: false
argument-hint: ""
effort: xhigh
when_to_use: |
"cuestiona esta decisión", "stress-test esto", "abogado del diablo", "pre-mortem", "estoy seguro de esto", "challenge this decision", "devil's advocate"
Decision Stress-Test
Multi-perspective adversarial analysis of a technical decision before commitment. Spawns 5-12 perspectives in parallel (depending on stakes), runs an adaptive cross-debate gated by a Step-back Judge (max 3 cycles), and synthesizes with 5 adversarial techniques. Anti-hallucination verification is a gate within the process.
Underlying Principle
"If everyone is thinking alike, then somebody isn't thinking." — Tenth Man Rule
Sycophancy is the silent killer of engineering decisions. Hallucination is its loud cousin. This skill builds structural disagreement into the analysis: at least one perspective (Adversary) is forced to dissent; one (Outsider) cannot be contaminated by context; one (Linus) refuses unexplained complexity by construction. The aim is not contrarianism for its own sake — it is to make sure the decision survives a real challenge before commitment.
When to Use
| Trigger | Example |
|---|---|
| About to commit to architecture / library / framework | "Going to use Redis for sessions" |
| Want to challenge a proposed solution | "Claude proposed X — does it hold up?" |
| Suspect agreement bias | "Every time I ask, Claude agrees. Stress-test this." |
| Pre-PR / pre-merge for non-trivial design | "Before I open the PR, stress-test this design" |
When to Skip
| Anti-trigger | Why |
|---|---|
| Mechanical / trivial change | Renames, formatting, fixing a typo — no decision to stress-test |
| Decision already made and committed | Stress-test post-commitment is rationalization, not analysis |
| Pure debugging / fix | No alternatives to weigh — find the bug, fix it |
| Question is exploratory ("what is X?") | Use docs / Context7, not stress-test |
| Stakes are zero (e.g., personal preference in a one-off script) | Stress-test cost > decision cost |
| Decision is reversible / low-stakes / want quick scan | Use decide (3 perspectives, ~500-800 tokens) — cheaper ~3-4× |
Initial Triage
If the input is ambiguous, the invoker (the Lead) asks 1-4 AskUserQuestion calls before Phase 1. Subagents cannot ask the user directly — they include questions in their output, which the invoker consolidates.
| Question | When to ask |
|---|---|
| "What is the exact decision (pick one of N)?" | Input names a topic but not a choice |
| "What alternatives are on the table?" | Only one option mentioned |
| "What's the reversibility / time horizon?" | Stakes tier unclear |
| "What constraints (budget, deadline, team size, compliance)?" | Constraints missing |
If the conversation already has the context (the user just discussed it), skip the triage to avoid redundancy.
Stakes Calibration
| Stakes | Perspectives | Step-back judge | Techniques |
|---|---|---|---|
| Low | 5 (Outsider + Adversary + Maintainer + Linus + Simplifier) | OFF (no Phase 2) | Steel-Man + Assumption Audit |
| Medium | 8 (+ Performance + Operator + Cost Optimizer) | ON (max 3 cycles, adaptive) | + Pre-Mortem |
| High | 11 (+ Security + Product + Karpathy) | ON (max 3 cycles, adaptive) | All 5 |
| High with UX | 12 (+ User) | ON (max 3 cycles, adaptive) | All 5 |
Default-on-uncertainty: Medium. Detailed criteria, edge cases, and 6 worked classifications in ${CLAUDE_SKILL_DIR}/references/03-stakes-calibration.md.
Framing Check
Before spawning Phase 1, ask explicitly:
- Are we solving the right problem? (Sometimes the framed decision is the wrong question.)
- Is there an Option C the proposal didn't consider?
- Is the implicit "do nothing" path already evaluated?
This is a 30-second pause that catches a non-trivial fraction of bad framings. Output the answers in the report.
Phase 1: Perspectives in Parallel
Spawn N perspectives (per stakes tier) as general-purpose subagents in parallel — single message, N tool calls. Each receives a tightly scoped prompt:
| Perspective | Lens | Tools | Prompt source |
|---|---|---|---|
| Outsider | First principles, no project context | Input only — NO Read/Grep/Web | ${CLAUDE_SKILL_DIR}/prompts/outsider-agent.md |
| Adversary | Devil's advocate; 5 techniques; Tenth Man | Read, Grep, Context7, WebSearch | inline (see 01-perspectives.md) |
| Performance | Throughput, latency, memory | Read, Grep, Context7, WebSearch | inline |
| Security | OWASP, attack surface, supply chain | Read, Grep, Context7, WebSearch | inline |
| Maintainer | Tech debt, ergonomics, testability | Read, Grep, Context7 | inline |
| Simplifier | "What can we remove?" YAGNI lens | Read, Grep | inline (see 01-perspectives.md) |
| Operator | Deploy, observability, runtime | Read, Grep, Context7, WebSearch | inline |
| Cost Optimizer | $$$, TCO, opportunity cost, 80/20 alternatives | Read, Context7, WebSearch | inline |
| Product | Value, opportunity cost, roadmap | Read (docs), WebSearch | inline |
| Linus Torvalds | Pragmatic-brutal with technical analysis | Read, Grep | ${CLAUDE_SKILL_DIR}/prompts/linus-agent.md |
| Karpathy | AI-friendliness, modern stack pragmatism, build-observe-iterate | Read, Grep, Context7, WebSearch | ${CLAUDE_SKILL_DIR}/prompts/karpathy-agent.md |
| User (optional) | Public-surface DX/UX | Read, Grep | inline |
Each perspective emits the standard output format (Position, Confidence, Pros, Contras with severity, Context I needed, Questions). Full prompt templates and worked examples in ${CLAUDE_SKILL_DIR}/references/01-perspectives.md.
Phase 2: Cross-Debate (with Step-back Judge)
For Medium and High stakes, the orchestrator spawns a Step-back Judge alongside the perspectives. The judge is a meta-teammate that does NOT take positions on the decision; it evaluates debate quality after each round and returns one of three verdicts: CONVERGED, PARTIAL, or FULL. This replaces the previous fixed-rounds logic with adaptive deliberation.
| Verdict | Meaning | Orchestrator action |
|---|---|---|
| CONVERGED | Debate quality is sufficient; positions evidence-based; no major blind spots | Exit Phase 2 → Phase 3 |
| PARTIAL | Specific perspectives missed a specific point | Send targeted re-debate instructions only to named perspectives |
| FULL | Group missed framing / blind spot / Option C | Broadcast re-debate instructions to all perspectives |
Circuit breaker: max 3 cycles. After cycle 3 without CONVERGED → mark MAX_CYCLES_REACHED (Phase 4 downgrades confidence).
Step-back judge constraint: NEVER votes on the decision (Phase 5 excludes it from the tally). Only evaluates debate quality.
Step-back persona prompt at ${CLAUDE_SKILL_DIR}/prompts/step-back-judge.md. Full protocol, cycle thrashing prevention, premature-CONVERGED override in ${CLAUDE_SKILL_DIR}/references/06-cross-debate.md.
Phase 3: Synthesis with 5 Techniques
Apply the 5 adversarial techniques across the post-debate perspective outputs:
| Technique | Synthesis function |
|---|---|
| Steel-Man First | Strongest case FOR the decision built from surviving pros |
| Assumption Audit | All assumptions, scored by Likelihood × Impact, flagging shared ones as suspect |
| Pre-Mortem | Failure modes with triangulation count and confidence |
| Inversion | "What guarantees the worst outcome?" mapped to proposal |
| Second-Order Effects | What does this enable / prevent at 6-12 months (3rd order if High stakes) |
Detail and examples in ${CLAUDE_SKILL_DIR}/references/02-techniques.md.
Triangulation
When ≥2 perspectives independently surface the same finding, that finding is automatically tagged HIGH confidence and annotated [triangulated by N perspectives]. This is the central anti-sycophancy mechanism: agreement across diverse lenses is signal; agreement within one lens is noise. Caveat: triangulation can amplify shared bias — see Assumption Audit's groupthink flag and Anti-Pattern #8.
Verification Gate
Before any finding enters the output, the producing perspective must apply the standard Anti-Hallucination decision tree:
| Check | Required for |
|---|---|
| Glob | Claiming a file exists |
| Grep / Read | Claiming a function or pattern exists |
| Context7 / WebSearch | Claiming version-specific behaviour, recent CVEs, benchmark numbers |
LSP hover |
Claiming a type signature |
Forbidden actions:
- Inventing function signatures, file paths, CVE IDs, benchmark numbers
- Citing "industry best practice" without source
- Asserting something exists when verification was not run
If verification cannot be performed, the finding is tagged UNKNOWN and the user is told what verification is needed. Detail in ${CLAUDE_SKILL_DIR}/references/02-techniques.md (High-Risk Areas table).
Confidence Per Finding
| Tag | Criterion |
|---|---|
| HIGH | Verified by tool output OR triangulated by ≥2 perspectives |
| MEDIUM | Partial / related data; informed inference |
| LOW | Inference only, not verified, not triangulated |
| UNKNOWN | Not verified; user must answer or run verification |
Triangulation can promote a tag from MEDIUM → HIGH automatically. The promoted finding is annotated with the triangulation count. If the cross-debate was marked FAILED, triangulation promotion is suppressed (since the agreement may be groupthink, not signal).
Phase 4: Validation (Blocking Quality Gate)
After Phase 3 synthesis and before Phase 5 (final recommendation vote), the orchestrator runs a structured validation gate. Per Commandment IV ("intention isn't enough"), this is a blocking gate: any failed sub-check must be reported in the output. Silent suppression of a failed sub-check is anti-pattern #12.
| Sub-gate | Question | Pass criterion |
|---|---|---|
| 4.1 Findings Validation | Does every HIGH-tagged finding cite verifiable evidence? Does every UNKNOWN explicitly state what verification is missing? | All HIGH findings have a tool-output citation or [triangulated by N]; all UNKNOWN findings have a "needs X verification" annotation |
| 4.2 Debate Validation | Did Phase 2 produce real movement, or was it theatre? | ≥1 Position change OR ≥2 new pros/contras OR debate marked FAILED with reason |
| 4.3 Verdict Validation | Is the verdict actionable? Are monitoring signals measurable? Are invalidating conditions concrete? | Verdict has ≥1 concrete next step; ≥2 monitoring signals are measurable (numeric threshold or boolean event); ≥1 invalidating condition is named |
| 4.4 Self-Meta Check | Form audit (the 5-question check below) | All 5 pass OR failures are explicitly listed |
| 4.5 Pass/Fail Gate | Any sub-check failed? | If yes, output emits a Validation Report block stating which sub-checks failed and why. The verdict confidence is automatically downgraded one tier (HIGH→MEDIUM, MEDIUM→LOW) |
Why this exists: the previous phases produce content; this phase guarantees the content meets a quality bar before reaching the user. Without it, the skill can produce confident-sounding output that hides shaky findings, fake debates, or unmeasurable monitoring signals.
Failure-to-report is itself a failure: if Phase 4 finds an issue and the orchestrator does not surface it in the output, that is a violation of Commandment I (radical honesty) and Commandment IV (blocking gates). Honest failure beats silent pass.
Phase 5: Final Recommendation with Per-Perspective Vote
After Phase 4 passes (or honestly reports failures), the orchestrator drafts a final recommendation and broadcasts it back to the same perspectives that participated in Phase 1/2. Each perspective casts a vote with a 1-2 sentence reason. This preserves dissent, surfaces real consensus level, and prevents the orchestrator from "deciding alone" after all the upstream work.
Note: The Step-back Judge does NOT vote in Phase 5. It is a meta-evaluator, not a panelist. The vote tally only includes the perspectives that produced positions on the decision (5/8/11/12 according to the stakes tier), EXCLUDING the step-back judge.
| Sub-step | What happens |
|---|---|
| 5.1 Draft recommendation | Orchestrator drafts the final recommendation grounded in Phase 3 synthesis and Phase 4 validation outcome |
| 5.2 Broadcast | Orchestrator sends the draft to the N perspectives that participated (no new spawns; same agents, same context) |
| 5.3 Per-perspective vote | Each perspective returns: vote (SUPPORT / OPPOSE / CONDITIONAL / ABSTAIN) + 1-2 sentence reason. CONDITIONAL must name the condition. ABSTAIN must name what is missing |
| 5.4 Aggregate | Orchestrator builds a vote tally and transcribes dissenting reasons verbatim (not summarized) |
| 5.5 Consensus level | Computed and stamped on the final verdict: Strong consensus (≥80% SUPPORT), Mixed (50-80% SUPPORT, no OPPOSE), Weak consensus (50-80% SUPPORT with OPPOSE), No consensus (<50% SUPPORT) |
Why this exists: in earlier phases, perspectives produced opinions on the decision; here they produce opinions on the recommendation built from those opinions. These are different objects. A perspective can support its own findings yet oppose the orchestrator's framing of the recommendation.
Honesty rule: dissenting votes are transcribed verbatim. Suppressing or paraphrasing dissent to make consensus look stronger is anti-pattern #13.
Confidence interaction with Phase 4: if Phase 4 already downgraded confidence one tier, and Phase 5 returns weak/no consensus, the verdict tag is downgraded a second tier (LOW or UNKNOWN) and the recommendation is automatically reframed as "Investigate first" with the dissent reasons as the questions to investigate.
Output Format
The final report follows a structured template scaled by stakes (Low omits Pre-Mortem, Inversion, Second-Order; Medium omits Inversion, Second-Order; High includes everything). Sections:
- Stakes Tier (with reasoning)
- Initial Triage (questions + answers, or "skipped")
- Framing Check
- Phase 1 outputs (per perspective)
- Phase 2 Debate Summary (Medium+ only)
- Phase 3 Synthesis (5 techniques applied)
- Tradeoff Map (Medium+ stakes only) — matrix view of dimensions × options × weight × confidence
- Triangulated Findings
- Core Tension — the unresolved disagreement, if any
- Decision Guide (if/then table)
- Verdict (Proceed / Proceed with conditions / Investigate first / Reject)
- Confidence Calibration (Key Assumptions + Conditions Invalidating + Monitoring Signals)
- Phase 4 Validation Report (4.1 Findings + 4.2 Debate + 4.3 Verdict + 4.4 Self-Meta Check + 4.5 Overall Gate)
- Phase 5 Final Recommendation: vote tally + verbatim dissents + Consensus level (qualitative) + Decision Confidence Score (numeric, 5 signals)
- Outstanding questions for the user
Literal template + 2 worked examples (Low and High) in ${CLAUDE_SKILL_DIR}/references/05-output-template.md.
Self-Meta Check
Sub-gate 4.4 of Phase 4: before proceeding to Phase 5, run the 5-question audit:
- Steel-Man genuinely strong? (a thoughtful supporter would feel represented)
- No vague doom? (every Critical/Major finding has trigger + evidence + mitigation)
- Real diversity? (perspectives actually disagree on something)
- Outsider stayed isolated? (no file paths, class names, or absorbed framing)
- Debate produced movement? (≥1 Position change OR ≥2 new pros/contras; or honestly reported as FAILED)
Failed checks must be reported in the output. Silent suppression of a failed check is itself an anti-pattern (failure-to-report).
Anti-Patterns (Resumen)
| # | Anti-pattern | Mitigation |
|---|---|---|
| 1 | Contrarianism for its own sake | Disagree only with concrete reason |
| 2 | Nihilism (no verdict) | End with directional verdict |
| 3 | Straw-manning | Quote proposal verbatim before critique |
| 4 | Reverse confirmation bias | Update Position when evidence demands |
| 5 | Vague doom | Trigger + early warning + blast radius required |
| 6 | Personality critique | Engage proposal, not proposer |
| 7 | Objection without alternative | Every Critical/Major contra → mitigation/alternative |
| 8 | Groupthink across perspectives | Outsider as canary; spawn orthogonal Phase 1.5 |
| 9 | Outsider contamination | Restricted tools; reject if file paths appear |
| 10 | Linus caricature | Persona must be pragmatic-brutal WITH analysis |
| 11 | Fake debate | If 0 changes + 0 new findings → mark FAILED |
| 12 | Failure-to-report validation failure | Phase 4 failed sub-checks must appear explicitly in Validation Report; downgrade confidence one tier |
| 13 | Suppressed dissent | Verbatim transcription of CONDITIONAL/OPPOSE reasons; mechanical consensus level computation |
| 14 | Step-back capture | Judge stays meta; if it argues for/against, override and re-prompt |
| 15 | Karpathy/Linus solape | If both sound the same, voices weren't differentiated; verify the persona-specific output addenda differ |
| 16 | Simplifier nihilism | "Delete everything" without working alternative — Simplifier must propose minimal-viable version |
| 17 | Cost Optimizer false economy | Cutting critical dimensions (security/reliability/correctness) to save cents — flag any cost cut affecting these dimensions |
Full detection signals and corrections in ${CLAUDE_SKILL_DIR}/references/04-anti-patterns.md.
Content Map
| Topic | File |
|---|---|
| 11 perspectives + 1 optional, lenses, prompts, examples | ${CLAUDE_SKILL_DIR}/references/01-perspectives.md |
| 5 adversarial techniques as synthesis phase | ${CLAUDE_SKILL_DIR}/references/02-techniques.md |
| Stakes calibration matrix and edge cases | ${CLAUDE_SKILL_DIR}/references/03-stakes-calibration.md |
| 17 anti-patterns + Self-Meta Check audit | ${CLAUDE_SKILL_DIR}/references/04-anti-patterns.md |
| Output template + 2 worked examples (Low and High) | ${CLAUDE_SKILL_DIR}/references/05-output-template.md |
| Cross-debate protocol, rounds, late perspective | ${CLAUDE_SKILL_DIR}/references/06-cross-debate.md |
| Outsider agent prompt (isolated, no tools) | ${CLAUDE_SKILL_DIR}/prompts/outsider-agent.md |
| Linus Torvalds persona prompt | ${CLAUDE_SKILL_DIR}/prompts/linus-agent.md |
| Karpathy persona prompt | ${CLAUDE_SKILL_DIR}/prompts/karpathy-agent.md |
| Step-back judge persona | ${CLAUDE_SKILL_DIR}/prompts/step-back-judge.md |
Integration with Other Skills
| Skill | Relationship |
|---|---|
decide |
Generates 3-perspective options for a decision; decision-stress-test stress-tests ONE option with 5-12 perspectives + debate. Use decide to choose an option, then decision-stress-test before committing |
anti-hallucination |
Source of the Verification Gate and confidence levels; this skill inherits and applies them inside each perspective |
review-patterns |
Different scope: review-patterns reviews implementation; decision-stress-test reviews the decision before implementation |
security-review |
Different scope: security-review audits existing code; decision-stress-test includes Security as one perspective among 11 when stakes are High |
prompt-engineer |
If the user's input to this skill is too vague to triage, the invoker may use prompt-engineer to refine before spawning perspectives |
Commandments cubiertos
| # | Cómo |
|---|---|
| I | Adversarial perspectives counter sycophancy — the decision is challenged, not validated |
| II | Per-finding anti-hallucination Verification Gate; no speculative objections |
| V | Stress-tests the decision (understand) before commitment, not the implementation |
| VII | Perspective count is stakes-calibrated (5-12) — no fixed over-spend on low-stakes calls |