replication-driven-research - SKILL.md Agent Skill

name: replication-driven-research description: Use when starting empirical analysis, creating a data pipeline, generating results, or when data or model specifications change. Enforces end-to-end reproducibility — every number in the paper must be regenerable from raw data by a script with a fixed seed. Replaces TDD for the research domain.

Replication-Driven Research

Overview

This skill is the research-domain analogue of test-driven development. The core philosophy is the same as TDD: evidence before claims, automated verification, invalidation on input change. No result is valid until the pipeline runs end-to-end without error. No number enters the paper until a script regenerates it from raw data. This is the single most important discipline in the superpapers plugin, and every analysis skill defers to it.

When to Use

Start of a new empirical project
First time running analysis code in a project
Before generating any table or figure for a paper
After any change to raw data, sample selection, or model specification
Before declaring a result "final" or "verified"
When the user asks "is this reproducible?"
When reviewer comments require re-running the analysis

Canonical Directory Structure

The canonical project layout is:

project-root/
├── data/
│   ├── raw/                    # raw downloads, never manually edited
│   ├── processed/              # cleaned data, output of scripts
│   └── manifest.md             # documents every dataset (source, URL, date, variables)
├── code/
│   ├── 01_collect.R            # or .py — fetches raw data
│   ├── 02_clean.R              # raw → processed
│   ├── 03_analyze.R            # processed → results
│   └── 04_figures.R            # processed → figures
├── output/
│   ├── tables/                 # .tex files generated by scripts
│   ├── figures/                # .pdf vector files generated by scripts
│   └── logs/                   # execution logs with timestamp + seed
├── paper/
│   ├── paper.tex               # main document
│   ├── references.bib          # bibliography
│   └── sections/               # split sections if needed
└── CLAUDE.superpapers.md       # project settings

On first invocation in a project, propose this structure to the user. The user can accept, adapt, or refuse. If the user refuses or has a different layout, the skill still works but flags deviations when encountered so they are not silent.

Mandatory Steps

Verify or scaffold structure. If data/raw, code/, and output/ do not exist, propose scaffolding. Wait for user confirmation before creating any directories or files at the project root.
Document every dataset in data/manifest.md. Required fields per dataset: name, source (URL or API endpoint), description, collection date, variables used, frequency, period covered, license or usage notes. See the Manifest Format section below.
Every result must have a generating script. No exceptions. Pasting numbers from a console or notebook into the paper is forbidden. Tables enter the paper via \input{output/tables/...}; figures enter via \includegraphics{output/figures/...}.
Fix the seed in every script that uses randomness. Document the seed in the script header. Use the project-level default seed from CLAUDE.superpapers.md — read from the current working directory, or walk up parent directories until found — unless the script explicitly overrides it and explains why. If the file is absent, ask the user for a project-level default on first need.
Run the pipeline end-to-end before declaring any result verified. Use a top-level run_all.sh (or Makefile) that executes scripts in the correct order. Verify exit code 0 and that every expected output file exists and is non-empty.
Log every run. Each end-to-end execution writes to output/logs/YYYY-MM-DD_HH-MM-SS.log with: timestamp, seed, relevant package versions, total runtime, input file hashes, exit status. See the Execution Log Format section below.
Invalidate on input change. If data/raw or any script in code/ changes, all downstream outputs are stale. Re-run the full pipeline. Do not trust cached tables or figures after an input change.

Manifest Format

Each dataset entry in data/manifest.md follows this format:

## unemployment_br

- **Source:** IBGE — PNADC Trimestral
- **URL:** https://sidra.ibge.gov.br/tabela/4099
- **Collected:** 2026-03-15
- **Variables:** unemployment_rate, quarter, state
- **Frequency:** Quarterly
- **Period:** 2012Q1 – 2025Q4
- **Collected by:** code/01_collect.R
- **License:** IBGE open data

The manifest is the single source of truth for what data is in the project. If a dataset exists in data/raw/ but not in the manifest, fix the manifest before proceeding.

Execution Log Format

Each end-to-end run appends a log entry with one field per line:

timestamp: 2026-04-10T14:23:45-03:00
seed: 20260410
R version: 4.4.1
renv lockfile hash: abc123...
inputs:
  data/raw/pnadc.csv: sha256:def456...
scripts:
  code/01_collect.R: OK (4.2s)
  code/02_clean.R: OK (12.8s)
  code/03_analyze.R: OK (45.1s)
outputs:
  output/tables/tab_descriptives.tex: created
  output/tables/tab_main.tex: created
  output/figures/fig_trend.pdf: created
exit: 0
total runtime: 62.1s

The log is append-only. Keep every log entry — they are the audit trail for every published result.

Anti-Patterns

Copying a number from console output into the paper
"It worked yesterday" without re-running after a change to data or code
Results with no traceable script from raw data
Undocumented manual steps in the pipeline (opening a GUI, editing a spreadsheet, clicking a button)
Using set.seed(Sys.time()), random.seed(None), or no seed at all
Ignoring warnings emitted by the pipeline run
Running only one script and trusting cached outputs from other scripts
Editing raw data files manually to "fix" anomalies
Committing output/ files that are not produced by the current pipeline

Verification Before Completion

data/raw/ contains only downloaded files, never hand-edited
data/manifest.md documents every dataset in data/raw/ and data/processed/
Every .tex table and .pdf figure in output/ has a script in code/ that generates it
A top-level runner (run_all.sh or Makefile) exists and runs end-to-end with exit code 0
Seeds are fixed and documented at the top of every stochastic script
An execution log exists in output/logs/ for the latest run
No numeric values hardcoded in paper.tex — all results pulled via \input{}
No orphaned outputs in output/ that the current pipeline does not produce