decision-stress-test - SKILL.md Agent Skill

name: decision-stress-test description: | Somete a estrés una decisión técnica antes de comprometerse, mediante 5-12 perspectivas en paralelo (Outsider, Adversary, Simplifier, Maintainer, Linus, Cost, Performance, Operator, Karpathy, Security, Product, User) con rondas de cross-debate, más 5 técnicas adversariales (Steel-Man, Assumption Audit, Pre-Mortem, Inversion, Second-Order), calibrado por stakes, con verificación anti-alucinación en cada finding. Úsala cuando: a punto de comprometer una elección de arquitectura/librería, quieres cuestionar una solución propuesta, sospechas sesgo de complacencia, "cuestiona esta decisión", "abogado del diablo", "pre-mortem", "estoy seguro de esto". Keywords - stress-test, challenge this decision, devils-advocate, contrarian analysis, steel-man, pre-mortem, before-deciding, before-committing, antes-de-decidir, cuestiona esta decisión, stress-testea disable-model-invocation: false argument-hint: "" effort: xhigh when_to_use: | "cuestiona esta decisión", "stress-test esto", "abogado del diablo", "pre-mortem", "estoy seguro de esto", "challenge this decision", "devil's advocate"

Decision Stress-Test

Multi-perspective adversarial analysis of a technical decision before commitment. Spawns 5-12 perspectives in parallel (depending on stakes), runs an adaptive cross-debate gated by a Step-back Judge (max 3 cycles), and synthesizes with 5 adversarial techniques. Anti-hallucination verification is a gate within the process.

Underlying Principle

"If everyone is thinking alike, then somebody isn't thinking." — Tenth Man Rule

Sycophancy is the silent killer of engineering decisions. Hallucination is its loud cousin. This skill builds structural disagreement into the analysis: at least one perspective (Adversary) is forced to dissent; one (Outsider) cannot be contaminated by context; one (Linus) refuses unexplained complexity by construction. The aim is not contrarianism for its own sake — it is to make sure the decision survives a real challenge before commitment.

When to Use

Trigger	Example
About to commit to architecture / library / framework	"Going to use Redis for sessions"
Want to challenge a proposed solution	"Claude proposed X — does it hold up?"
Suspect agreement bias	"Every time I ask, Claude agrees. Stress-test this."
Pre-PR / pre-merge for non-trivial design	"Before I open the PR, stress-test this design"

When to Skip

Anti-trigger	Why
Mechanical / trivial change	Renames, formatting, fixing a typo — no decision to stress-test
Decision already made and committed	Stress-test post-commitment is rationalization, not analysis
Pure debugging / fix	No alternatives to weigh — find the bug, fix it
Question is exploratory ("what is X?")	Use docs / Context7, not stress-test
Stakes are zero (e.g., personal preference in a one-off script)	Stress-test cost > decision cost
Decision is reversible / low-stakes / want quick scan	Use `decide` (3 perspectives, ~500-800 tokens) — cheaper ~3-4×

Initial Triage

If the input is ambiguous, the invoker (the Lead) asks 1-4 AskUserQuestion calls before Phase 1. Subagents cannot ask the user directly — they include questions in their output, which the invoker consolidates.

Question	When to ask
"What is the exact decision (pick one of N)?"	Input names a topic but not a choice
"What alternatives are on the table?"	Only one option mentioned
"What's the reversibility / time horizon?"	Stakes tier unclear
"What constraints (budget, deadline, team size, compliance)?"	Constraints missing

If the conversation already has the context (the user just discussed it), skip the triage to avoid redundancy.

Stakes Calibration

Stakes	Perspectives	Step-back judge	Techniques
Low	5 (Outsider + Adversary + Maintainer + Linus + Simplifier)	OFF (no Phase 2)	Steel-Man + Assumption Audit
Medium	8 (+ Performance + Operator + Cost Optimizer)	ON (max 3 cycles, adaptive)	+ Pre-Mortem
High	11 (+ Security + Product + Karpathy)	ON (max 3 cycles, adaptive)	All 5
High with UX	12 (+ User)	ON (max 3 cycles, adaptive)	All 5

Default-on-uncertainty: Medium. Detailed criteria, edge cases, and 6 worked classifications in ${CLAUDE_SKILL_DIR}/references/03-stakes-calibration.md.

Framing Check

Before spawning Phase 1, ask explicitly:

Are we solving the right problem? (Sometimes the framed decision is the wrong question.)
Is there an Option C the proposal didn't consider?
Is the implicit "do nothing" path already evaluated?

This is a 30-second pause that catches a non-trivial fraction of bad framings. Output the answers in the report.

Phase 1: Perspectives in Parallel

Spawn N perspectives (per stakes tier) as general-purpose subagents in parallel — single message, N tool calls. Each receives a tightly scoped prompt:

Perspective	Lens	Tools	Prompt source
Outsider	First principles, no project context	Input only — NO Read/Grep/Web	`${CLAUDE_SKILL_DIR}/prompts/outsider-agent.md`
Adversary	Devil's advocate; 5 techniques; Tenth Man	Read, Grep, Context7, WebSearch	inline (see `01-perspectives.md`)
Performance	Throughput, latency, memory	Read, Grep, Context7, WebSearch	inline
Security	OWASP, attack surface, supply chain	Read, Grep, Context7, WebSearch	inline
Maintainer	Tech debt, ergonomics, testability	Read, Grep, Context7	inline
Simplifier	"What can we remove?" YAGNI lens	Read, Grep	inline (see `01-perspectives.md`)
Operator	Deploy, observability, runtime	Read, Grep, Context7, WebSearch	inline
Cost Optimizer	$$$, TCO, opportunity cost, 80/20 alternatives	Read, Context7, WebSearch	inline
Product	Value, opportunity cost, roadmap	Read (docs), WebSearch	inline
Linus Torvalds	Pragmatic-brutal with technical analysis	Read, Grep	`${CLAUDE_SKILL_DIR}/prompts/linus-agent.md`
Karpathy	AI-friendliness, modern stack pragmatism, build-observe-iterate	Read, Grep, Context7, WebSearch	`${CLAUDE_SKILL_DIR}/prompts/karpathy-agent.md`
User (optional)	Public-surface DX/UX	Read, Grep	inline

Each perspective emits the standard output format (Position, Confidence, Pros, Contras with severity, Context I needed, Questions). Full prompt templates and worked examples in ${CLAUDE_SKILL_DIR}/references/01-perspectives.md.

Phase 2: Cross-Debate (with Step-back Judge)

For Medium and High stakes, the orchestrator spawns a Step-back Judge alongside the perspectives. The judge is a meta-teammate that does NOT take positions on the decision; it evaluates debate quality after each round and returns one of three verdicts: CONVERGED, PARTIAL, or FULL. This replaces the previous fixed-rounds logic with adaptive deliberation.

Verdict	Meaning	Orchestrator action
CONVERGED	Debate quality is sufficient; positions evidence-based; no major blind spots	Exit Phase 2 → Phase 3
PARTIAL	Specific perspectives missed a specific point	Send targeted re-debate instructions only to named perspectives
FULL	Group missed framing / blind spot / Option C	Broadcast re-debate instructions to all perspectives

Circuit breaker: max 3 cycles. After cycle 3 without CONVERGED → mark MAX_CYCLES_REACHED (Phase 4 downgrades confidence).

Step-back judge constraint: NEVER votes on the decision (Phase 5 excludes it from the tally). Only evaluates debate quality.

Step-back persona prompt at ${CLAUDE_SKILL_DIR}/prompts/step-back-judge.md. Full protocol, cycle thrashing prevention, premature-CONVERGED override in ${CLAUDE_SKILL_DIR}/references/06-cross-debate.md.

Phase 3: Synthesis with 5 Techniques

Apply the 5 adversarial techniques across the post-debate perspective outputs:

Technique	Synthesis function
Steel-Man First	Strongest case FOR the decision built from surviving pros
Assumption Audit	All assumptions, scored by Likelihood × Impact, flagging shared ones as suspect
Pre-Mortem	Failure modes with triangulation count and confidence
Inversion	"What guarantees the worst outcome?" mapped to proposal
Second-Order Effects	What does this enable / prevent at 6-12 months (3rd order if High stakes)

Detail and examples in ${CLAUDE_SKILL_DIR}/references/02-techniques.md.

Triangulation

When ≥2 perspectives independently surface the same finding, that finding is automatically tagged HIGH confidence and annotated [triangulated by N perspectives]. This is the central anti-sycophancy mechanism: agreement across diverse lenses is signal; agreement within one lens is noise. Caveat: triangulation can amplify shared bias — see Assumption Audit's groupthink flag and Anti-Pattern #8.

Verification Gate

Before any finding enters the output, the producing perspective must apply the standard Anti-Hallucination decision tree:

Check	Required for
Glob	Claiming a file exists
Grep / Read	Claiming a function or pattern exists
Context7 / WebSearch	Claiming version-specific behaviour, recent CVEs, benchmark numbers
LSP `hover`	Claiming a type signature

Forbidden actions:

Inventing function signatures, file paths, CVE IDs, benchmark numbers
Citing "industry best practice" without source
Asserting something exists when verification was not run

If verification cannot be performed, the finding is tagged UNKNOWN and the user is told what verification is needed. Detail in ${CLAUDE_SKILL_DIR}/references/02-techniques.md (High-Risk Areas table).

Confidence Per Finding

Tag	Criterion
HIGH	Verified by tool output OR triangulated by ≥2 perspectives
MEDIUM	Partial / related data; informed inference
LOW	Inference only, not verified, not triangulated
UNKNOWN	Not verified; user must answer or run verification

Triangulation can promote a tag from MEDIUM → HIGH automatically. The promoted finding is annotated with the triangulation count. If the cross-debate was marked FAILED, triangulation promotion is suppressed (since the agreement may be groupthink, not signal).

Phase 4: Validation (Blocking Quality Gate)

After Phase 3 synthesis and before Phase 5 (final recommendation vote), the orchestrator runs a structured validation gate. Per Commandment IV ("intention isn't enough"), this is a blocking gate: any failed sub-check must be reported in the output. Silent suppression of a failed sub-check is anti-pattern #12.

Sub-gate	Question	Pass criterion
4.1 Findings Validation	Does every HIGH-tagged finding cite verifiable evidence? Does every UNKNOWN explicitly state what verification is missing?	All HIGH findings have a tool-output citation or `[triangulated by N]`; all UNKNOWN findings have a "needs X verification" annotation
4.2 Debate Validation	Did Phase 2 produce real movement, or was it theatre?	≥1 Position change OR ≥2 new pros/contras OR debate marked FAILED with reason
4.3 Verdict Validation	Is the verdict actionable? Are monitoring signals measurable? Are invalidating conditions concrete?	Verdict has ≥1 concrete next step; ≥2 monitoring signals are measurable (numeric threshold or boolean event); ≥1 invalidating condition is named
4.4 Self-Meta Check	Form audit (the 5-question check below)	All 5 pass OR failures are explicitly listed
4.5 Pass/Fail Gate	Any sub-check failed?	If yes, output emits a `Validation Report` block stating which sub-checks failed and why. The verdict confidence is automatically downgraded one tier (HIGH→MEDIUM, MEDIUM→LOW)

Why this exists: the previous phases produce content; this phase guarantees the content meets a quality bar before reaching the user. Without it, the skill can produce confident-sounding output that hides shaky findings, fake debates, or unmeasurable monitoring signals.

Failure-to-report is itself a failure: if Phase 4 finds an issue and the orchestrator does not surface it in the output, that is a violation of Commandment I (radical honesty) and Commandment IV (blocking gates). Honest failure beats silent pass.

Phase 5: Final Recommendation with Per-Perspective Vote

After Phase 4 passes (or honestly reports failures), the orchestrator drafts a final recommendation and broadcasts it back to the same perspectives that participated in Phase 1/2. Each perspective casts a vote with a 1-2 sentence reason. This preserves dissent, surfaces real consensus level, and prevents the orchestrator from "deciding alone" after all the upstream work.

Note: The Step-back Judge does NOT vote in Phase 5. It is a meta-evaluator, not a panelist. The vote tally only includes the perspectives that produced positions on the decision (5/8/11/12 according to the stakes tier), EXCLUDING the step-back judge.

Sub-step	What happens
5.1 Draft recommendation	Orchestrator drafts the final recommendation grounded in Phase 3 synthesis and Phase 4 validation outcome
5.2 Broadcast	Orchestrator sends the draft to the N perspectives that participated (no new spawns; same agents, same context)
5.3 Per-perspective vote	Each perspective returns: vote (`SUPPORT` / `OPPOSE` / `CONDITIONAL` / `ABSTAIN`) + 1-2 sentence reason. CONDITIONAL must name the condition. ABSTAIN must name what is missing
5.4 Aggregate	Orchestrator builds a vote tally and transcribes dissenting reasons verbatim (not summarized)
5.5 Consensus level	Computed and stamped on the final verdict: `Strong consensus` (≥80% SUPPORT), `Mixed` (50-80% SUPPORT, no OPPOSE), `Weak consensus` (50-80% SUPPORT with OPPOSE), `No consensus` (<50% SUPPORT)

Why this exists: in earlier phases, perspectives produced opinions on the decision; here they produce opinions on the recommendation built from those opinions. These are different objects. A perspective can support its own findings yet oppose the orchestrator's framing of the recommendation.

Honesty rule: dissenting votes are transcribed verbatim. Suppressing or paraphrasing dissent to make consensus look stronger is anti-pattern #13.

Confidence interaction with Phase 4: if Phase 4 already downgraded confidence one tier, and Phase 5 returns weak/no consensus, the verdict tag is downgraded a second tier (LOW or UNKNOWN) and the recommendation is automatically reframed as "Investigate first" with the dissent reasons as the questions to investigate.

Output Format

The final report follows a structured template scaled by stakes (Low omits Pre-Mortem, Inversion, Second-Order; Medium omits Inversion, Second-Order; High includes everything). Sections:

Stakes Tier (with reasoning)
Initial Triage (questions + answers, or "skipped")
Framing Check
Phase 1 outputs (per perspective)
Phase 2 Debate Summary (Medium+ only)
Phase 3 Synthesis (5 techniques applied)
Tradeoff Map (Medium+ stakes only) — matrix view of dimensions × options × weight × confidence
Triangulated Findings
Core Tension — the unresolved disagreement, if any
Decision Guide (if/then table)
Verdict (Proceed / Proceed with conditions / Investigate first / Reject)
Confidence Calibration (Key Assumptions + Conditions Invalidating + Monitoring Signals)
Phase 4 Validation Report (4.1 Findings + 4.2 Debate + 4.3 Verdict + 4.4 Self-Meta Check + 4.5 Overall Gate)
Phase 5 Final Recommendation: vote tally + verbatim dissents + Consensus level (qualitative) + Decision Confidence Score (numeric, 5 signals)
Outstanding questions for the user

Literal template + 2 worked examples (Low and High) in ${CLAUDE_SKILL_DIR}/references/05-output-template.md.

Self-Meta Check

Sub-gate 4.4 of Phase 4: before proceeding to Phase 5, run the 5-question audit:

Steel-Man genuinely strong? (a thoughtful supporter would feel represented)
No vague doom? (every Critical/Major finding has trigger + evidence + mitigation)
Real diversity? (perspectives actually disagree on something)
Outsider stayed isolated? (no file paths, class names, or absorbed framing)
Debate produced movement? (≥1 Position change OR ≥2 new pros/contras; or honestly reported as FAILED)

Failed checks must be reported in the output. Silent suppression of a failed check is itself an anti-pattern (failure-to-report).

Anti-Patterns (Resumen)

#	Anti-pattern	Mitigation
1	Contrarianism for its own sake	Disagree only with concrete reason
2	Nihilism (no verdict)	End with directional verdict
3	Straw-manning	Quote proposal verbatim before critique
4	Reverse confirmation bias	Update Position when evidence demands
5	Vague doom	Trigger + early warning + blast radius required
6	Personality critique	Engage proposal, not proposer
7	Objection without alternative	Every Critical/Major contra → mitigation/alternative
8	Groupthink across perspectives	Outsider as canary; spawn orthogonal Phase 1.5
9	Outsider contamination	Restricted tools; reject if file paths appear
10	Linus caricature	Persona must be pragmatic-brutal WITH analysis
11	Fake debate	If 0 changes + 0 new findings → mark FAILED
12	Failure-to-report validation failure	Phase 4 failed sub-checks must appear explicitly in Validation Report; downgrade confidence one tier
13	Suppressed dissent	Verbatim transcription of CONDITIONAL/OPPOSE reasons; mechanical consensus level computation
14	Step-back capture	Judge stays meta; if it argues for/against, override and re-prompt
15	Karpathy/Linus solape	If both sound the same, voices weren't differentiated; verify the persona-specific output addenda differ
16	Simplifier nihilism	"Delete everything" without working alternative — Simplifier must propose minimal-viable version
17	Cost Optimizer false economy	Cutting critical dimensions (security/reliability/correctness) to save cents — flag any cost cut affecting these dimensions

Full detection signals and corrections in ${CLAUDE_SKILL_DIR}/references/04-anti-patterns.md.

Content Map

Topic	File
11 perspectives + 1 optional, lenses, prompts, examples	`${CLAUDE_SKILL_DIR}/references/01-perspectives.md`
5 adversarial techniques as synthesis phase	`${CLAUDE_SKILL_DIR}/references/02-techniques.md`
Stakes calibration matrix and edge cases	`${CLAUDE_SKILL_DIR}/references/03-stakes-calibration.md`
17 anti-patterns + Self-Meta Check audit	`${CLAUDE_SKILL_DIR}/references/04-anti-patterns.md`
Output template + 2 worked examples (Low and High)	`${CLAUDE_SKILL_DIR}/references/05-output-template.md`
Cross-debate protocol, rounds, late perspective	`${CLAUDE_SKILL_DIR}/references/06-cross-debate.md`
Outsider agent prompt (isolated, no tools)	`${CLAUDE_SKILL_DIR}/prompts/outsider-agent.md`
Linus Torvalds persona prompt	`${CLAUDE_SKILL_DIR}/prompts/linus-agent.md`
Karpathy persona prompt	`${CLAUDE_SKILL_DIR}/prompts/karpathy-agent.md`
Step-back judge persona	`${CLAUDE_SKILL_DIR}/prompts/step-back-judge.md`

Integration with Other Skills

Skill	Relationship
`decide`	Generates 3-perspective options for a decision; `decision-stress-test` stress-tests ONE option with 5-12 perspectives + debate. Use `decide` to choose an option, then `decision-stress-test` before committing
`anti-hallucination`	Source of the Verification Gate and confidence levels; this skill inherits and applies them inside each perspective
`review-patterns`	Different scope: `review-patterns` reviews implementation; `decision-stress-test` reviews the decision before implementation
`security-review`	Different scope: `security-review` audits existing code; `decision-stress-test` includes Security as one perspective among 11 when stakes are High
`prompt-engineer`	If the user's input to this skill is too vague to triage, the invoker may use `prompt-engineer` to refine before spawning perspectives

Commandments cubiertos

#	Cómo
I	Adversarial perspectives counter sycophancy — the decision is challenged, not validated
II	Per-finding anti-hallucination Verification Gate; no speculative objections
V	Stress-tests the decision (understand) before commitment, not the implementation
VII	Perspective count is stakes-calibrated (5-12) — no fixed over-spend on low-stakes calls