replication-driven-research

star 1.8k

Use when starting empirical analysis, creating a data pipeline, generating results, or when data or model specifications change. Enforces end-to-end reproducibility — every number in the paper must be regenerable from raw data by a script with a fixed seed. Replaces TDD for the research domain.

brycewang-stanford By brycewang-stanford schedule Updated 6/3/2026

name: replication-driven-research description: Use when starting empirical analysis, creating a data pipeline, generating results, or when data or model specifications change. Enforces end-to-end reproducibility — every number in the paper must be regenerable from raw data by a script with a fixed seed. Replaces TDD for the research domain.

Replication-Driven Research

Overview

This skill is the research-domain analogue of test-driven development. The core philosophy is the same as TDD: evidence before claims, automated verification, invalidation on input change. No result is valid until the pipeline runs end-to-end without error. No number enters the paper until a script regenerates it from raw data. This is the single most important discipline in the superpapers plugin, and every analysis skill defers to it.

When to Use

  • Start of a new empirical project
  • First time running analysis code in a project
  • Before generating any table or figure for a paper
  • After any change to raw data, sample selection, or model specification
  • Before declaring a result "final" or "verified"
  • When the user asks "is this reproducible?"
  • When reviewer comments require re-running the analysis

Canonical Directory Structure

The canonical project layout is:

project-root/
├── data/
│   ├── raw/                    # raw downloads, never manually edited
│   ├── processed/              # cleaned data, output of scripts
│   └── manifest.md             # documents every dataset (source, URL, date, variables)
├── code/
│   ├── 01_collect.R            # or .py — fetches raw data
│   ├── 02_clean.R              # raw → processed
│   ├── 03_analyze.R            # processed → results
│   └── 04_figures.R            # processed → figures
├── output/
│   ├── tables/                 # .tex files generated by scripts
│   ├── figures/                # .pdf vector files generated by scripts
│   └── logs/                   # execution logs with timestamp + seed
├── paper/
│   ├── paper.tex               # main document
│   ├── references.bib          # bibliography
│   └── sections/               # split sections if needed
└── CLAUDE.superpapers.md       # project settings

On first invocation in a project, propose this structure to the user. The user can accept, adapt, or refuse. If the user refuses or has a different layout, the skill still works but flags deviations when encountered so they are not silent.

Mandatory Steps

  1. Verify or scaffold structure. If data/raw, code/, and output/ do not exist, propose scaffolding. Wait for user confirmation before creating any directories or files at the project root.

  2. Document every dataset in data/manifest.md. Required fields per dataset: name, source (URL or API endpoint), description, collection date, variables used, frequency, period covered, license or usage notes. See the Manifest Format section below.

  3. Every result must have a generating script. No exceptions. Pasting numbers from a console or notebook into the paper is forbidden. Tables enter the paper via \input{output/tables/...}; figures enter via \includegraphics{output/figures/...}.

  4. Fix the seed in every script that uses randomness. Document the seed in the script header. Use the project-level default seed from CLAUDE.superpapers.md — read from the current working directory, or walk up parent directories until found — unless the script explicitly overrides it and explains why. If the file is absent, ask the user for a project-level default on first need.

  5. Run the pipeline end-to-end before declaring any result verified. Use a top-level run_all.sh (or Makefile) that executes scripts in the correct order. Verify exit code 0 and that every expected output file exists and is non-empty.

  6. Log every run. Each end-to-end execution writes to output/logs/YYYY-MM-DD_HH-MM-SS.log with: timestamp, seed, relevant package versions, total runtime, input file hashes, exit status. See the Execution Log Format section below.

  7. Invalidate on input change. If data/raw or any script in code/ changes, all downstream outputs are stale. Re-run the full pipeline. Do not trust cached tables or figures after an input change.

Manifest Format

Each dataset entry in data/manifest.md follows this format:

## unemployment_br

- **Source:** IBGE — PNADC Trimestral
- **URL:** https://sidra.ibge.gov.br/tabela/4099
- **Collected:** 2026-03-15
- **Variables:** unemployment_rate, quarter, state
- **Frequency:** Quarterly
- **Period:** 2012Q1 – 2025Q4
- **Collected by:** code/01_collect.R
- **License:** IBGE open data

The manifest is the single source of truth for what data is in the project. If a dataset exists in data/raw/ but not in the manifest, fix the manifest before proceeding.

Execution Log Format

Each end-to-end run appends a log entry with one field per line:

timestamp: 2026-04-10T14:23:45-03:00
seed: 20260410
R version: 4.4.1
renv lockfile hash: abc123...
inputs:
  data/raw/pnadc.csv: sha256:def456...
scripts:
  code/01_collect.R: OK (4.2s)
  code/02_clean.R: OK (12.8s)
  code/03_analyze.R: OK (45.1s)
outputs:
  output/tables/tab_descriptives.tex: created
  output/tables/tab_main.tex: created
  output/figures/fig_trend.pdf: created
exit: 0
total runtime: 62.1s

The log is append-only. Keep every log entry — they are the audit trail for every published result.

Anti-Patterns

  • Copying a number from console output into the paper
  • "It worked yesterday" without re-running after a change to data or code
  • Results with no traceable script from raw data
  • Undocumented manual steps in the pipeline (opening a GUI, editing a spreadsheet, clicking a button)
  • Using set.seed(Sys.time()), random.seed(None), or no seed at all
  • Ignoring warnings emitted by the pipeline run
  • Running only one script and trusting cached outputs from other scripts
  • Editing raw data files manually to "fix" anomalies
  • Committing output/ files that are not produced by the current pipeline

Verification Before Completion

  • data/raw/ contains only downloaded files, never hand-edited
  • data/manifest.md documents every dataset in data/raw/ and data/processed/
  • Every .tex table and .pdf figure in output/ has a script in code/ that generates it
  • A top-level runner (run_all.sh or Makefile) exists and runs end-to-end with exit code 0
  • Seeds are fixed and documented at the top of every stochastic script
  • An execution log exists in output/logs/ for the latest run
  • No numeric values hardcoded in paper.tex — all results pulled via \input{}
  • No orphaned outputs in output/ that the current pipeline does not produce
Install via CLI
npx skills add https://github.com/brycewang-stanford/Auto-Empirical-Research-Skills --skill replication-driven-research
Repository Details
star Stars 1,795
call_split Forks 266
navigation Branch main
article Path SKILL.md
More from Creator
brycewang-stanford
brycewang-stanford Explore all skills →