gtd

name: gtd description: Warrant-first research GTD system. Manages the capture-clarify-organize-reflect-engage cycle for causal inference research. Scaffolds hypotheses/, insights/, decisions/ directories. Interrogates conjectures, files results, tracks binding decisions, checks pipeline freshness, drives the courtroom checklist. allowed-tools: Read, Write, Edit, Bash, AskUserQuestion argument-hint: '[init | conjecture | insight | decide | pipeline | status | courtroom]'

/gtd — Warrant-First Research GTD

Philosophy

Research with an AI thinking partner is iterated dialogue between human judgment and agent throughput, where every cycle either strengthens a warrant or kills a claim. The harness is whatever machinery makes that dialogue fast, honest, and recoverable.

Four elements:

Element	Definition	Role of dialogue
Frame	A question worth asking	Interrogated by dialogue
Work	A way to interrogate it	Supervised by dialogue (agents make this cheap)
Warrant	A way to know you've earned the answer	Built by dialogue — this is the product
Dialogue	The substrate across all three	Human and agent argue their way to claims that hold

The binding constraint has shifted. Pre-agents, Work was binding (coding, cleaning, drafting). Agents make Work cheap. The binding constraint is now Frame and Warrant — what to ask, and whether you've earned the answer. Design the harness around that reallocation.

Commands

`/gtd init`

Creates the directory structure in the current project:

hypotheses/INDEX.md    — DAG of testable claims
insights/INDEX.md      — Atomic findings with provenance
decisions/INDEX.md     — Binding commitments that constrain the pipeline
dashboard.html         — Visual status (serves from localhost)
scripts/build_dashboard_data.py — Regenerates dashboard_data.json

Then asks: "What's the first claim you want to test?"

`/gtd conjecture`

The clarify step. Adversarial interrogation:

You state something you believe.
I run the courtroom checklist:
- Estimand: What parameter are you trying to learn?
- Population: On whom?
- Variation: What source of variation identifies it?
- Mechanism: What's the treatment assignment process?
- Falsification: What specific result would kill this?
- Sub-claims: Can this decompose into independently testable pieces?
We agree on the precise statement.
I write hypotheses/HXX_slug.md and update INDEX.

`/gtd insight`

File a result:

What did we find? (One sentence, exact numbers.)
Which hypothesis does it speak to?
What pipeline script produced it? (Must be pipeline, not ad hoc.)
Is the figure fresh? (Script timestamp vs. output timestamp.)

Writes insights/YYYY-MM-DD_slug.md, updates the linked hypothesis, regenerates dashboard_data.json.

`/gtd decide`

Commit a binding design choice:

What's the decision?
Why? (One sentence.)
What does it constrain downstream?

Writes to decisions/INDEX.md. Updates CLAUDE.md if the decision persists across sessions.

`/gtd pipeline`

Check freshness:

For each output, is the source script newer? → stale.
Does every figure trace to a pipeline script? → orphans flagged.
When was the pipeline last verified?

Runs python3 scripts/build_dashboard_data.py and reports.

`/gtd status`

Quick orientation: hypothesis DAG, pipeline freshness, next actions.

`/gtd courtroom`

Walk through the DiD checklist stage by stage:

Show Bite — the event was real
Event Studies — dynamic effects, pre-trends = 0
Falsification — placebo finds nothing
Main Results — headline ATT
Mechanisms — why, heterogeneity

For each: present the exhibit, interrogate it, confirm or flag. Populates the manuscript view as we go. After completion, draft the narrative from confirmed material in the chosen voice.

The Courtroom (DiD Checklist)

Every quasi-experimental study presents its case. The courtroom is the general form — not just DiD but any design that requires:

A first-order effect to exist (show bite)
A credible counterfactual (event study / pre-trends)
Falsification of confounders (placebo period)
The estimate itself (main results)
Understanding of why (mechanisms)

Two cross-cutting standards apply to ALL stages:

Beautiful — figures and tables communicate clearly
Verified — pipeline reproducibility, referee2 audits, number consistency

File Formats

Hypothesis (`hypotheses/HXX_slug.md`)

---
id: H01a
status: conjecture | testing | confirmed | rejected | complicated
parent: H01
date_proposed: 2026-05-19
---

## Claim
[One sentence, testable.]

## Courtroom
- Estimand: [what parameter]
- Population: [on whom]
- Variation: [what identifies it]
- Falsification: [what kills it]

## Evidence
- [links to insights, added as they accumulate]

Insight (`insights/YYYY-MM-DD_slug.md`)

---
date: 2026-04-10
updates: H01a
result: confirmed | rejected | complicated
stage: [2, 4]           # optional — courtroom stage(s) this speaks to. Overrides keyword matching.
script: scripts/r/05_estimate_did.R
output: output/figures/event_study.pdf
---

## Finding
[The fact. Numbers. Script path. What it means for the hypothesis.]

## Key Numbers
[Table with point estimate, SE, CI, p-value, N]

## Context
[Specification details, baseline, relative magnitude]

Decision (`decisions/INDEX.md`)

Table format. One row per binding decision:

| ID | Decision | Date | Rationale |
|---|---|---|---|
| D01 | Primary estimator is TWFE with district and week FE | 2026-04-01 | Sufficient pre-periods; no staggered-timing bias |

Status Transitions

conjecture → testing:      First pipeline script assigned to test this hypothesis
testing → confirmed:       Positive evidence + falsification passes (Stages 2-4 confirmed)
testing → rejected:        Evidence contradicts + falsification confirms the negative
testing → complicated:     Evidence mixed OR falsification fails
complicated → confirmed:   Complication resolved (new evidence or new design)
complicated → rejected:    Further investigation confirms failure

Rules:

A hypothesis CANNOT move to confirmed without passing falsification (Stage 3)
A hypothesis CAN move directly from conjecture to rejected (if "kills it" condition met immediately)
complicated is NOT terminal — it requires resolution
Parent hypothesis status = worst child status (if any child is complicated, parent is at most testing)

Pipeline Levels

Level	Name	Contains	Example
1	Cleaning	Raw → clean; format standardization	`00_clean_survey.py`
2	Derived	Clean → derived variables; joins, constructs	`02_build_panel.py`
3	Classification	Derived → treatment/control assignment	`03_classify_treated.py`
4	Figures	Descriptive outputs, maps, timelines	`04_descriptive_figures.R`
5	Estimation	Causal inference; the main results	`05_estimate_did.R`

Rules:

A level-N script may only read outputs from levels < N
Numbering within level is sequential (00, 01, 02...)
Language suffix indicates the tool (.py, .R, .do)
Every output in output/figures/ must map to exactly one pipeline script

Freshness

Freshness is computed dynamically by comparing file modification times:

output.mtime >= script.mtime → FRESH (output generated after script was last modified)
output.mtime < script.mtime → STALE (script changed since output was generated)
Output does not exist → MISSING

Freshness is NEVER stored as a permanent field. It is always computed at runtime by build_dashboard_data.py. The fresh field in insight frontmatter is a snapshot at filing time — the dashboard recomputes it.

INDEX.md Formats

`hypotheses/INDEX.md` — Hierarchical DAG

# Hypothesis DAG

## H01 — Main Claim
Status: **testing**
One sentence description.

### H01a — Sub-claim
Status: **confirmed** (date)
One sentence description.

Two levels: parent hypotheses (##) and children (###). Each entry has bold status inline.

`insights/INDEX.md` — Table

# Insights Log

| Date | Finding | Hypothesis | Status |
|---|---|---|---|
| 2026-04-15 | [Placebo is null](file.md) | H01a | confirmed |
| 2026-04-10 | [Urban ATT = 2.3pp](file.md) | H01a | confirmed |

Most recent first. Links to individual insight files.

Courtroom → Dashboard Flow

When /gtd courtroom confirms a stage:

The relevant insight(s) are filed (if not already)
The linked hypothesis status may update
build_dashboard_data.py regenerates the JSON
Dashboard Courtroom tab shows the stage as confirmed (green)
Dashboard Manuscript tab allows the confirmed material to appear

When /gtd courtroom flags a stage as complicated:

An insight is filed with result: complicated
The linked hypothesis moves to complicated
Dashboard Courtroom tab shows the stage with a yellow indicator
Manuscript tab moves that material to "Unearned"

Hooks

Only add hooks for failures that are silently wrong (produce plausible but incorrect output).

Do hook: Classification file changes but county file not rebuilt → wrong treatment set → wrong ATT → presented wrong numbers. Silent failure. Hook it.

Don't hook: Missing figure → LaTeX won't compile. Visible failure. Don't hook it.

Starter hook (adapt paths to your project):

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Write",
      "command": "if echo \"$TOOL_INPUT\" | grep -q 'LINCHPIN_FILE_NAME'; then echo '⚠️ PIPELINE DEPENDENCY: Rebuild downstream'; fi"
    }]
  }
}

Dashboard

The dashboard (dashboard.html) reads from dashboard_data.json generated by scripts/build_dashboard_data.py. It shows:

Status — pipeline freshness, hypothesis summary, latest finding, next actions
Courtroom — 5-stage checklist with expandable evidence panels
Pipeline — scripts grouped by level with freshness indicators
Hypotheses — claim DAG with color-coded status
Decisions — binding commitments table
Figures — all outputs: pipeline vs. orphaned, fresh vs. stale
Manuscript — only confirmed claims with fresh evidence appear here; unearned claims are listed separately

Serve with: cd project_root && python3 -m http.server 8080

GTD Mapping

GTD Stage	Research Equivalent	Mechanism
Capture	Ideas emerge through dialogue	The chat itself
Clarify	Courtroom checklist + interrogation	`/gtd conjecture` or `/gtd courtroom`
Organize	Commit to directory	`hypotheses/` `decisions/` `CLAUDE.md`
Reflect	Dashboard review	`dashboard.html`
Engage	Run the pipeline	`scripts/` → `output/`

Principles

The pipeline is the source of truth. A figure only counts if it traces to a numbered pipeline script.
Freshness is visible. You should never wonder whether an output is current.
Decisions bind. Once committed, they constrain downstream work across sessions.
Hypotheses are falsifiable. Every one has a "kills it" condition written before the test.
The conversation is the inbox. It generates ideas. The directory captures them.
Warrant is the product. Not the coefficient — the structure that earns the right to assert it.
Verification is cheap and constant. Not a quality gate at the end.

/gtd — Warrant-First Research GTD

Philosophy

Commands

/gtd init

/gtd conjecture

/gtd insight

/gtd decide

/gtd pipeline

/gtd status

/gtd courtroom