hypothesis-experiment - SKILL.md Agent Skill

name: hypothesis-experiment description: Run the full hypothesis experiment workflow (Steps 0-10) — worktree, classify, design review (5 perspectives), human gate, implement, code review (5 perspectives), run, analyze, FINDINGS review (10 perspectives), self-audit, commit + PR. Enforces convergence protocol and hard gates. argument-hint: disable-model-invocation: true

Hypothesis Experiment Workflow

You are running the hypothesis experiment workflow for hypothesis $ARGUMENTS.

Canonical source: docs/contributing/hypothesis.md (v2.0). Read it NOW before proceeding. Standards: docs/contributing/standards/experiments.md (ED-1–ED-6, RCV-1–RCV-6). Template: docs/contributing/templates/hypothesis.md (FINDINGS.md structure). Review prompts: See review-prompts.md for exact agent dispatch prompts.

Hard Rules (non-negotiable)

Follow the steps in order. Do not skip steps. Do not reorder.
Human gate at Step 3 is a STOP. Present the design and WAIT. Do not say "I'll proceed unless you stop me."
Convergence = zero CRITICAL + zero IMPORTANT from ALL reviewers in a round. Re-run the ENTIRE round after fixes. No exceptions. No shortcuts.
NEVER trust agent self-reported convergence. Independently verify every finding count. Agents have fabricated "0 CRITICAL, 0 IMPORTANT" when actual review found 3 CRITICAL + 18 IMPORTANT (#390).
Code review BEFORE running experiments (Step 5). Three of four major bugs in PR #310 would have been caught.
Max 10 rounds per gate. If not converged by round 10, suspend the experiment.
Cross-gate regression: If Code or FINDINGS Review finds a design flaw, loop back to Step 2. Max 2 regressions total.

Lessons Learned (encoded from MEMORY.md)

--stderr must come BEFORE other flags in blis_run calls — harness checks position 3 only
CLI defaults vs workload-spec YAML: --rate mode uses CLI defaults (512/512 tokens). Workload YAMLs define own distributions. Capacity estimates MUST match the actual workload.
--total-kv-blocks default is context-dependent: CLI default 1000000 but defaults.yaml overrides to 132139 for llama/H100/TP=2. Check defaults.yaml.
Conservation formula: Canonical INV-1 base formula is 5-term: injected == completed + still_queued + still_running + dropped_unservable + timed_out (cluster runs add gateway/routing/encode buckets — see canonical INV-1 in docs/contributing/standards/invariants.md). parse_blis_output does NOT extract dropped_unservable or timed_out — parse them separately.
Analyzer verdict must match FINDINGS status: If analyze.py produces a different verdict than FINDINGS.md, acknowledge the discrepancy explicitly.
Think before coding calibration: Compute parameters analytically from alpha/beta coefficients FIRST, then validate with a tiny run.
Beta coefficients (llama-3.1-8b, H100, TP=2): [6910.42, 17.67, 2.84] -> stepTime = beta0 + beta1cacheMissTokens + beta2decodeTokens
Alpha coefficients: [1601.35, 3.51, 1805.54] -> queueDelay = alpha0 + alpha1*inputLen; outputProcessing = alpha2
WorkloadSpec YAML format: id: not client_id:, process: not type: in arrival, aggregate_rate: top-level, rate_fraction: per client, distribution params under params: with std_dev not stdev

Step 0: Create Worktree

Create an isolated workspace FIRST, before any other work.

/superpowers:using-git-worktrees h-$ARGUMENTS

All subsequent steps happen in the worktree. Set your working directory there.

Step 1: Select and Classify

Check coverage gaps: Browse the hypothesis-archive branch for the coverage catalog
Classify the hypothesis:
- Family: Which of the 6 families? (See docs/contributing/standards/experiments.md)
- VV&UQ: Verification, Validation, or UQ?
- Type: Deterministic or Statistical? If statistical: dominance, monotonicity, equivalence, or Pareto?
Write the hypothesis sentence using the family-specific pattern from experiments.md
Add diagnostic clause: "If this fails, it would indicate..."

WARNING: Pose the hypothesis WITHOUT reading the code first. Code-grounded hypotheses test implementation, not behavior.

Step 2: Design + Design Review

2a: Design the Experiment

Follow ED-1 through ED-6 (see docs/contributing/standards/experiments.md):

ED-1: Controlled comparison (vary exactly ONE dimension)
ED-2: Rate awareness (run where effect expected AND where it should vanish)
ED-3: Precondition verification (in script, not just prose)
ED-4: Workload seed independence (3 seeds minimum for statistical: 42, 123, 456)
ED-5: Reproducibility (everything from run.sh alone)
ED-6: Config diff against referenced experiments

Compute parameters analytically from alpha/beta coefficients. Do NOT guess.

2b: Design Review (5 perspectives)

Dispatch 5 parallel review agents using the convergence-review skill:

/convergence-review h-design

Alternatively, dispatch manually. See review-prompts.md Section A for exact prompts.

Perspectives: (1) Hypothesis Quality, (2) ED Rigor, (3) Parameter Calibration, (4) Control Completeness, (5) DES/Domain Fit

The convergence-review skill enforces the protocol automatically. If dispatching manually, apply the convergence protocol:

Launch all 5 in parallel as background Task agents (subagent_type="general-purpose", model=REVIEW_MODEL from --model flag, default "haiku")
Collect all findings classified as CRITICAL / IMPORTANT / SUGGESTION
Zero CRITICAL + zero IMPORTANT = converged -> proceed to Step 3
Any CRITICAL or IMPORTANT -> fix all, re-run ENTIRE round (not just failed perspectives)
Independently count findings yourself. Do not trust agent summaries.

Step 3: Human Approval Gate — STOP HERE

Present the experiment design to the user. Include:

Hypothesis sentence + classification (family, VV&UQ, type)
Experiment design summary (configurations, controlled variables, seeds)
Parameter choices with analytical derivation
Planned controls (one per proposed mechanism)
Expected outcomes and diagnostic implications

This is a hard gate. WAIT for explicit human approval before proceeding.

Use the AskUserQuestion tool:

"Do you approve this experiment design?"
Options: "Approve — proceed to implementation", "Revise — I have feedback"

Step 4: Implement

First, copy the shared harness from the archive branch:

mkdir -p hypotheses/lib
git show hypothesis-archive:hypotheses/lib/harness.sh > hypotheses/lib/harness.sh
git show hypothesis-archive:hypotheses/lib/analyze_helpers.py > hypotheses/lib/analyze_helpers.py

Then create hypotheses/h-$ARGUMENTS/run.sh and hypotheses/h-$ARGUMENTS/analyze.py.

Mandatory harness requirements

run.sh MUST:

#!/bin/bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/../lib/harness.sh"
setup_experiment "${1:-}"

Use blis_run for EVERY simulation call (not $BINARY run directly)
Every blis_run needs a timeout tier: $TIMEOUT_QUICK (<100 req), $TIMEOUT_STANDARD (100-500), $TIMEOUT_EXTENDED (>500)
If using --total-kv-blocks, call preflight_kv_check first
Call python3 "$SCRIPT_DIR/analyze.py" ... at the end

analyze.py MUST:

#!/usr/bin/env python3
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "lib"))
from analyze_helpers import parse_blis_output, check_for_timeout

Use parse_blis_output() for all metric extraction
Check metrics["timed_out"] before computing ratios
Print warnings to stderr, results to stdout
Verify INV-1 conservation (5-term base formula: completed + still_queued + still_running + dropped_unservable + timed_out; add gateway/routing/encode buckets for cluster runs) for every run

Step 5: Code Review (5 perspectives) — BEFORE running

Every run.sh and analyze.py must be code-reviewed BEFORE running.

Dispatch 5 parallel review agents using the convergence-review skill:

/convergence-review h-code hypotheses/h-$ARGUMENTS/

Alternatively, dispatch manually. See review-prompts.md Section B for exact prompts.

Perspectives: (1) Parser-Output Agreement, (2) CLI Flag Correctness, (3) YAML Field Validation, (4) Config Diff (ED-6), (5) Seed and Determinism

Apply the convergence protocol (same rules as Step 2b).

Cross-gate regression: If this review finds a design flaw, loop back to Step 2.

Step 6: Run Experiments

After Code Review converges:

Deterministic: Single seed sufficient
Statistical: Minimum 3 seeds (42, 123, 456) per configuration
Execute: bash hypotheses/h-$ARGUMENTS/run.sh
Verify reproducibility: running twice produces identical output

Step 7: Analyze and Document

Review analyzer output from Step 6
Trace every causal claim through code (RCV-1: cite file:line)
Compute expected values from first principles for any "surprises" (RCV-2)
Check mechanism AND direction (RCV-3)
Write FINDINGS.md using docs/contributing/templates/hypothesis.md — ALL sections must be non-empty

Step 8: FINDINGS Review (10 perspectives)

Dispatch 10 parallel review agents using the convergence-review skill:

/convergence-review h-findings hypotheses/h-$ARGUMENTS/FINDINGS.md

Alternatively, dispatch manually. See review-prompts.md Section C for exact prompts.

Perspectives: (1) Code Verifier, (2) Experiment Designer, (3) Statistical Rigor, (4) Control Auditor, (5) Standards Compliance, (6) Substance/Logic, (7) DES Mechanism, (8) Reproducibility, (9) Cross-Experiment, (10) User Guidance

Expected: 1-5 rounds (10 perspectives = higher quality bar).

CRITICAL: Independently read every agent's output and count findings yourself. Do NOT rely on agent self-reported totals.

Cross-gate regression: If design flaw found, loop back to Step 2. Max 2 regressions total.

Step 9: Self-Audit (6 dimensions)

This is NOT an agent pass. Stop. Think critically. Answer each question yourself.

Logic bugs in analyzer: Trace through analyze.py mentally. Edge cases? Silent defaults to 0? Integer vs float?
Reproducibility: Would ./run.sh again produce identical output? Any non-deterministic dependencies?
FINDINGS.md consistency: Does Status match Results data? Does Devil's Advocate actually argue against the conclusion?
Cross-experiment contradictions: Do findings contradict prior experiments or MEMORY.md knowledge?
User guidance: Would a BLIS user know what to do with these findings?
Issue filing completeness: Every actionable finding has a planned issue?

Fix all issues found. Then proceed to Step 10.

Step 10: Verify + Commit + PR

If code fixes were discovered:

go build ./...
go test ./... -count=1
golangci-lint run ./...

All three must pass before committing.

Commit and PR:

/commit-commands:commit-push-pr

PR description must include:

Hypothesis sentence and status
Key findings (1-3 bullets)
Fixes #NNN for any issues addressed

Post-PR: File issues per the taxonomy

Bug: --label bug — code defects discovered
Enhancement: --label enhancement — improvements needed
New hypothesis: --label hypothesis — follow-up experiments (use .github/ISSUE_TEMPLATE/hypothesis.md)
Design limitation: --label design — documented limitations
Standards update: --label standards — new rules/invariants

Issue title format for hypotheses: behavioral prediction ("X should Y"), NOT a task ("test X").

Convergence Protocol Quick Reference

Rule	Detail
Converged	Zero CRITICAL + zero IMPORTANT from ALL reviewers in current round
Not converged	Fix all issues, re-run ENTIRE round
Max rounds	10 per gate (Design, Code, FINDINGS each independent)
SUGGESTION items	Do not block convergence
Agent timeout	5 min per reviewer; if exceeded, check output and restart
Agent failure	Fall back to performing that review directly
Severity doubtful?	If fixing it would change a conclusion → IMPORTANT. If only readability → SUGGESTION
Model for reviewers	Default: haiku (~2-3 min, thorough reviews). Override via `/convergence-review <gate> --model sonnet\|opus`