hypothesis-tree - SKILL.md Agent Skill

name: hypothesis-tree description: > Structured hypothesis tree manager for multi-step research investigations. Use when investigating "why X fails" questions that branch into sub-hypotheses, or when you need to track falsifiable claims with linked evidence and timestamps. Subcommands — add, update, review, prune. auto-activate: false

Hypothesis Tree Manager

Use when building, updating, or reviewing the persistent hypothesis tree for your research project. Maintains a structured markdown file at docs/hypothesis_tree.md where each hypothesis has falsifiable criteria, linked evidence, and a next experiment.

Trigger: When user types /hypothesis-tree with a subcommand.

Iron Law

NO HYPOTHESIS WITHOUT FALSIFIABLE CRITERIA AND A NEXT EXPERIMENT

Arguments

$ARGUMENTS — subcommand and parameters:
- add "H: <statement>" — add a new hypothesis node
- update <ID> --evidence-for|--evidence-against "<description>" --file <path> — attach evidence
- review — display all hypotheses with staleness indicators
- prune — archive hypotheses with no evidence updates in 14+ days

Anti-Rationalization Table

Excuse	Reality
"I'll add evidence later"	Evidence without a timestamp rots; link it now or lose the thread
"The evidence is obvious"	Obvious to you today, opaque to a reviewer in 6 months — cite the file
"This hypothesis is too simple to formalize"	Simple claims break papers; the backprop tier inversion was "obvious" until it wasn't
"I just need a placeholder"	Placeholders become permanent — state the falsifiable criterion or don't add it

Red Flags — STOP and Restart

Adding a hypothesis with no falsifiable criterion
Linking evidence to a file path that does not exist on disk
Claiming evidence supports a hypothesis without reading the actual file contents
Updating confidence level without new evidence (confidence follows evidence, not intuition)
Any hypothesis that cannot be tested with data already in results/ or a concrete next experiment

If any red flag triggers: STOP. Re-read the hypothesis. Rewrite it with falsifiable criteria before proceeding.

Tree Node Schema

Each hypothesis in docs/hypothesis_tree.md follows this structure:

### H<N>: <Statement>

- **Status:** ACTIVE | SUPPORTED | REFUTED | ARCHIVED
- **Confidence:** HIGH | MEDIUM | LOW
- **Falsifiable criterion:** <What specific observation would disprove this?>
- **Evidence FOR:**
  - [<description>](<relative path to file>) — <date added>
- **Evidence AGAINST:**
  - [<description>](<relative path to file>) — <date added>
- **Next experiment:** <Concrete action that would strengthen or weaken this hypothesis>
- **Last updated:** <YYYY-MM-DD>

Project Context

Result directories: results/evaluation/{model}/ contains per-kernel JSON files
Models: claude-sonnet, gemini-2.5-flash-lite, groq-llama-3.3-70b, together-qwen-3.5
Results: results/ directory (check CLAUDE.md for structure)
Analysis outputs: analysis/ directory (if present)
Check .claude/rules/known-issues.md for known anomalies
Check CLAUDE.md for project-specific counts and baselines

Workflow

Phase 1: Parse Subcommand

Extract the subcommand from $ARGUMENTS:

add — go to Phase 2a
update — go to Phase 2b
review — go to Phase 2c
prune — go to Phase 2d
No argument or help — display usage summary

Phase 2a: Add Hypothesis

Parse the hypothesis statement from arguments
Verification gate: Confirm the statement includes:
- A falsifiable criterion (what would disprove it?)
- A next experiment (what concrete action tests it?)
- If either is missing, prompt the user — do NOT add without them
Read docs/hypothesis_tree.md (create if it doesn't exist)
Assign the next sequential ID (H1, H2, H3...)
Write the new node using the Tree Node Schema above
Set Status=ACTIVE, Confidence=LOW (new hypotheses start at LOW)
Set Last updated to today's date

Verification gate: Read back the file and confirm the node was written correctly.

Phase 2b: Update Hypothesis

Parse the hypothesis ID and evidence from arguments
Read docs/hypothesis_tree.md — confirm the ID exists
If --file <path> is provided:
- Verification gate: Check the file exists on disk (ls or Glob)
- Verification gate: Read the file and confirm it contains data supporting the claim
- If the file doesn't exist or doesn't support the claim: STOP and report to user
Append the evidence entry with today's date
If --confidence <level> is provided, update confidence (only with evidence justification)
Update "Last updated" to today's date
If evidence now clearly supports or refutes the hypothesis, suggest changing Status

Verification gate: Read back the updated node and confirm evidence was appended correctly.

Phase 2c: Review All Hypotheses

Read docs/hypothesis_tree.md
For each hypothesis, compute:
- Days since last update
- Staleness indicator: FRESH (<7 days), AGING (7-14 days), STALE (>14 days)
- Evidence balance: count of FOR vs AGAINST entries
Display summary table:

=== HYPOTHESIS TREE REVIEW ===
ID    Status     Confidence  Evidence (F/A)  Staleness   Statement
H1    ACTIVE     HIGH        4/1             FRESH       Per-kernel difficulty is not...
H2    ACTIVE     LOW         1/0             STALE       Augmentation level affects...
H3    REFUTED    —           2/3             FRESH       Model size predicts pass rate...

Flag any ACTIVE hypothesis that is STALE — these need attention
Flag any hypothesis with 0 evidence entries — needs data or should be pruned

Verification gate: Cross-check that every file path referenced in evidence entries still exists on disk. Report any broken links.

Phase 2d: Prune Stale Hypotheses

Read docs/hypothesis_tree.md
Identify hypotheses with:
- No evidence updates in 14+ days AND Status=ACTIVE
- Zero evidence entries AND created 7+ days ago
Display candidates for archival — do NOT auto-archive
Wait for user confirmation before changing any Status to ARCHIVED
For confirmed archives, move to an ## Archived section at the bottom of the file

Verification gate: Read back the file and confirm archived hypotheses were moved correctly.

Phase 3: Report

After any subcommand, display:

What was changed (added/updated/archived)
Current tree statistics: total hypotheses, active, supported, refuted, archived
Any recommended next actions (stale hypotheses, broken evidence links)