rigorous-experiments - SKILL.md Agent Skill

name: rigorous-experiments description: This skill should be used when designing, running, validating, or auditing statistical experiments on personal or observational time-series data (health metrics, speech/text corpora, behavioral logs, diaries, n-of-1 self-tracking). It enforces pre-registration, exact permutation tests, FDR discipline, data-validation gates, adversarial code review, and cross-validation with external models. Triggers on "design an experiment", "test this hypothesis on my data", "is this correlation real", "audit these findings", "pre-register", "validate this dataset", or any n-of-1 / quantified-self analysis request.

Rigorous Experiments

Run statistical experiments on observational/personal time-series data that survive scrutiny. Distilled from a 54-experiment n-of-1 program in which sampled permutation tests, missing-data artifacts, app-categorization bugs and collinear mechanisms repeatedly manufactured — and then destroyed — "findings". Every rule here exists because its absence once produced a wrong conclusion.

Modes

Pick the mode matching the request; chain them for a full study.

Mode	When	Reference
design	New hypothesis or study	`references/design.md`
conduct	Implementing + running the experiment	`references/statistics.md`
validate-data	Before trusting ANY new data source	`references/data-validation.md`
cross-validate	Findings worth defending; code review; external model review (e.g. GPT Pro)	`references/cross-validation.md`
investigate-leads	A sweep/run produced leads (p<0.06, not FDR-confirmed)	`references/lead-investigation.md`
audit	Re-examining past claims, registries of findings	`references/statistics.md` §Audit

Non-negotiable core (all modes)

Pre-register before computing. Hypotheses, exact tests, family size m, and the acceptance threshold go in the script docstring BEFORE the first run. Post-hoc tests are reported as descriptive, never promoted.
Exact permutation, never sampled, on small n. A session sequence of n=19 has 18 circular shifts: the minimum honest p is ~1/19≈0.05. Sampling 2000 shifts with replacement fabricates precision (this killed a flagship "q=0.028" finding). Use scripts/perm_stats.py.
Permute over the full calendar, not the compressed series. Shifting a gap-compressed series breaks the timeline; keep missingness as NaN masks re-applied per shift. Event indicators must be pure 0/1 with no gaps — missingness lives only in the outcome series.
BH with FIXED family size m, a LITERAL CONSTANT declared at design time — never len(tests) (that defeats pre-registration; the linter rejects it). Assert the run matches the declared m. Confirmatory families small and separate from exploratory sweeps; pooling everything into one BH buries true effects, cherry-picking families manufactures them. Plain BH assumes independent/positively-dependent tests; for strongly dependent lag families use BH-Yekutieli or maxT resampling.
Stationarity check before correlating trending series. Exact circular shift on a trending series is "exactly, reproducibly wrong": report prewhitened-r (AR1 residuals) and stationary bootstrap alongside.
Stratify before pooling (Simpson check): within group (e.g. therapy/coaching) and within regime (pre/post known breaks). A pooled r=−0.25 once hid therapy −0.64 vs coaching +0.53.
Controls can re-describe a finding, not just kill it. When a control collapses an effect, check collinearity of control and predictor — r(self-focus, session-length)=0.79 meant "mechanism ambiguous", not "effect fake". Report the decomposition.
Honest statuses: confirmed (q<0.10 exact) ≠ lead (p<0.06) ≠ null ≠ descriptive. Status flips are recorded, never silently edited. Nulls with adequate power are findings. Robust ≠ significant: a lead surviving leave-one-out at small n is still underpowered — a candidate for prospective test, not a finding. 8b. Series scope is part of the test. A lagged "[t+1]" means the next unit in the series the hypothesis is about, not the next pooled row; define scope before lagging (it once flipped a sign). When recomputing a prior result, reproduce a stored artifact on that scope first.
Privacy: raw text/audio never enters output files or external uploads — statistics, rates and embedding-derived scores only.
Plain-language reporting: every statistic carries its practical meaning inline; define r/p/q/n once per report; no untranslated jargon calques. Narrative first, numbers as support.

Workflow (full study)

validate-data gate on any new source (see reference — the checklist has caught: zero-vs-missing conflation, dedup semantics, substring category bugs, rolling purge windows, timezone conventions).
design: pre-registered hypotheses + family + power sanity.
conduct: implement with scripts/perm_stats.py; run; write results JSON with tests, statuses, and caveats including known limitations.
cross-validate: adversarial code review (e.g. Codex read-only) BEFORE trusting results; fix findings; re-run. For major claims, external model review with a privacy-screened archive.
investigate-leads on anything that surfaced as a lead (not at the same scale — the triage battery: LOO, directionality, detrend-vs-step, within-cycle, prewhiten+bootstrap; consolidate same-direction leads into one composite). Mark diagnostic runs descriptive_only: true.
Verdicts in honest prose (mixed/rejected allowed); report; registry update with status provenance.

Viewing results

Launch the bundled explorer over any directory of results JSONs:

python3 scripts/explorer.py <results_dir> [--port 8799] [--pattern "exp*.json"] [--sort newest|oldest]

Generates explorer.html in the directory, starts (or reuses) a loopback http server on the port, and opens the browser: experiment list with confirmed/lead badges, filter, sortable test tables color-coded by status, verdicts, caveats, raw JSON. The page fetches result files live — re-running experiments updates the view; re-run the script only when new result files appear. Serve over localhost, never file:// (CDN fonts) and never on a non-loopback interface (results may contain personal statistics).

Evals

Run python3 evals/run_evals.py (from the skill directory) to lint an experiment script/results pair against the standards (pre-registration present, fixed literal m, exact perm usage, caveats, no raw text in outputs). A diagnostic/triage run that intentionally mints no new tests sets descriptive_only: true in its results JSON to satisfy the "has tests" check. Eval cases in evals/cases/ document expected pass/fail examples.