rigorous-experiments

star 268

This skill should be used when designing, running, validating, or auditing statistical experiments on personal or observational time-series data (health metrics, speech/text corpora, behavioral logs, diaries, n-of-1 self-tracking). It enforces pre-registration, exact permutation tests, FDR discipline, data-validation gates, adversarial code review, and cross-validation with external models. Triggers on "design an experiment", "test this hypothesis on my data", "is this correlation real", "audit these findings", "pre-register", "validate this dataset", or any n-of-1 / quantified-self analysis request.

glebis By glebis schedule Updated 6/8/2026

name: rigorous-experiments description: This skill should be used when designing, running, validating, or auditing statistical experiments on personal or observational time-series data (health metrics, speech/text corpora, behavioral logs, diaries, n-of-1 self-tracking). It enforces pre-registration, exact permutation tests, FDR discipline, data-validation gates, adversarial code review, and cross-validation with external models. Triggers on "design an experiment", "test this hypothesis on my data", "is this correlation real", "audit these findings", "pre-register", "validate this dataset", or any n-of-1 / quantified-self analysis request.

Rigorous Experiments

Run statistical experiments on observational/personal time-series data that survive scrutiny. Distilled from a 54-experiment n-of-1 program in which sampled permutation tests, missing-data artifacts, app-categorization bugs and collinear mechanisms repeatedly manufactured — and then destroyed — "findings". Every rule here exists because its absence once produced a wrong conclusion.

Modes

Pick the mode matching the request; chain them for a full study.

Mode When Reference
design New hypothesis or study references/design.md
conduct Implementing + running the experiment references/statistics.md
validate-data Before trusting ANY new data source references/data-validation.md
cross-validate Findings worth defending; code review; external model review (e.g. GPT Pro) references/cross-validation.md
investigate-leads A sweep/run produced leads (p<0.06, not FDR-confirmed) references/lead-investigation.md
audit Re-examining past claims, registries of findings references/statistics.md §Audit

Non-negotiable core (all modes)

  1. Pre-register before computing. Hypotheses, exact tests, family size m, and the acceptance threshold go in the script docstring BEFORE the first run. Post-hoc tests are reported as descriptive, never promoted.
  2. Exact permutation, never sampled, on small n. A session sequence of n=19 has 18 circular shifts: the minimum honest p is ~1/19≈0.05. Sampling 2000 shifts with replacement fabricates precision (this killed a flagship "q=0.028" finding). Use scripts/perm_stats.py.
  3. Permute over the full calendar, not the compressed series. Shifting a gap-compressed series breaks the timeline; keep missingness as NaN masks re-applied per shift. Event indicators must be pure 0/1 with no gaps — missingness lives only in the outcome series.
  4. BH with FIXED family size m, a LITERAL CONSTANT declared at design time — never len(tests) (that defeats pre-registration; the linter rejects it). Assert the run matches the declared m. Confirmatory families small and separate from exploratory sweeps; pooling everything into one BH buries true effects, cherry-picking families manufactures them. Plain BH assumes independent/positively-dependent tests; for strongly dependent lag families use BH-Yekutieli or maxT resampling.
  5. Stationarity check before correlating trending series. Exact circular shift on a trending series is "exactly, reproducibly wrong": report prewhitened-r (AR1 residuals) and stationary bootstrap alongside.
  6. Stratify before pooling (Simpson check): within group (e.g. therapy/coaching) and within regime (pre/post known breaks). A pooled r=−0.25 once hid therapy −0.64 vs coaching +0.53.
  7. Controls can re-describe a finding, not just kill it. When a control collapses an effect, check collinearity of control and predictor — r(self-focus, session-length)=0.79 meant "mechanism ambiguous", not "effect fake". Report the decomposition.
  8. Honest statuses: confirmed (q<0.10 exact) ≠ lead (p<0.06) ≠ null ≠ descriptive. Status flips are recorded, never silently edited. Nulls with adequate power are findings. Robust ≠ significant: a lead surviving leave-one-out at small n is still underpowered — a candidate for prospective test, not a finding. 8b. Series scope is part of the test. A lagged "[t+1]" means the next unit in the series the hypothesis is about, not the next pooled row; define scope before lagging (it once flipped a sign). When recomputing a prior result, reproduce a stored artifact on that scope first.
  9. Privacy: raw text/audio never enters output files or external uploads — statistics, rates and embedding-derived scores only.
  10. Plain-language reporting: every statistic carries its practical meaning inline; define r/p/q/n once per report; no untranslated jargon calques. Narrative first, numbers as support.

Workflow (full study)

  1. validate-data gate on any new source (see reference — the checklist has caught: zero-vs-missing conflation, dedup semantics, substring category bugs, rolling purge windows, timezone conventions).
  2. design: pre-registered hypotheses + family + power sanity.
  3. conduct: implement with scripts/perm_stats.py; run; write results JSON with tests, statuses, and caveats including known limitations.
  4. cross-validate: adversarial code review (e.g. Codex read-only) BEFORE trusting results; fix findings; re-run. For major claims, external model review with a privacy-screened archive.
  5. investigate-leads on anything that surfaced as a lead (not at the same scale — the triage battery: LOO, directionality, detrend-vs-step, within-cycle, prewhiten+bootstrap; consolidate same-direction leads into one composite). Mark diagnostic runs descriptive_only: true.
  6. Verdicts in honest prose (mixed/rejected allowed); report; registry update with status provenance.

Viewing results

Launch the bundled explorer over any directory of results JSONs:

python3 scripts/explorer.py <results_dir> [--port 8799] [--pattern "exp*.json"] [--sort newest|oldest]

Generates explorer.html in the directory, starts (or reuses) a loopback http server on the port, and opens the browser: experiment list with confirmed/lead badges, filter, sortable test tables color-coded by status, verdicts, caveats, raw JSON. The page fetches result files live — re-running experiments updates the view; re-run the script only when new result files appear. Serve over localhost, never file:// (CDN fonts) and never on a non-loopback interface (results may contain personal statistics).

Evals

Run python3 evals/run_evals.py (from the skill directory) to lint an experiment script/results pair against the standards (pre-registration present, fixed literal m, exact perm usage, caveats, no raw text in outputs). A diagnostic/triage run that intentionally mints no new tests sets descriptive_only: true in its results JSON to satisfy the "has tests" check. Eval cases in evals/cases/ document expected pass/fail examples.

Install via CLI
npx skills add https://github.com/glebis/claude-skills --skill rigorous-experiments
Repository Details
star Stars 268
call_split Forks 36
navigation Branch main
article Path SKILL.md
More from Creator