name: check-reproducibility description: Simulate a fresh-clone reproduction of the entire pipeline and diff the new outputs against the committed ones. Catches drift before paper submission or release. disable-model-invocation: true argument-hint: "" allowed-tools: ["Bash", "Read", "Grep", "Glob"]
Check Reproducibility
Run the entire pipeline as if from a fresh clone, then diff the new output/ against the committed output/. Any drift is a reproducibility failure.
When to Use
- Before submitting a paper
- Before tagging a release
- Before merging a major branch
- When onboarding a new collaborator
- After any non-trivial Stata version upgrade
Steps
Pre-flight checks:
- Working tree is clean (
git statusshows no uncommitted changes) — otherwise the diff is meaningless. If dirty, ask the user to commit/stash first. data/raw/is non-empty, OR aRAW_DATA_RESTORE_CMDis configured (e.g., amake restore-rawtarget or a download URL documented indata/README.md).
- Working tree is clean (
Snapshot current outputs:
cp -r output /tmp/output_snapshotClean the worktree (preserves
data/raw/since it's gitignored):bash scripts/check_reproducibility.sh --clean-onlyThis wraps
git clean -dfx -e data/raw -e .claude/stateto wipe everything else.Re-run the pipeline:
bash scripts/run_pipeline.shCapture exit code; if non-zero, the pipeline itself failed → reproducibility cannot be assessed.
Diff:
diff -r /tmp/output_snapshot output | head -200For binary files (PDF, PNG),
diffwill report differences but not show them. Compare the.csvcompanions of any flagged tables — those are text and can be diffed cell-by-cell.Categorize drift:
- Numerical drift in
.csvtables → FAIL (the analysis is non-reproducible; investigate seed, sample order, package versions) - Visual drift in
.pdf/.pngfigures → typically WARN (could be font rendering, scheme, or an actual difference — open both and compare) - Timestamp metadata only → PASS (cosmetic; many tools embed timestamps)
- No drift → PASS
- Numerical drift in
Restore snapshot if drift acceptable (otherwise leave new outputs and investigate):
rm -rf output && mv /tmp/output_snapshot outputReport:
- Stages that ran + timings
- Files that differ + diff category
- Verdict: PASS / WARN / FAIL
- If FAIL: top suspects (seeded randomness, package version drift, undeclared input)
Examples
/check-reproducibility→ Runs the full check on the current working tree./check-reproducibilityafter upgrading Stata to a new version → Reveals any version-sensitive results.
Troubleshooting
- Pipeline fails on re-run — most common cause: a do-file references a file that exists locally but was never committed. Add it to git or document in
data/README.md. - Numerical drift on bootstrap-based SE — bootstrap is reproducible only if
set seedis at the top, ONCE, andbootstrapitself doesn't reseed internally. Check the do-file. - Different cluster count from
reghdfesingleton drop — singleton drop depends on the order observations were merged in; ifmergeorder is not deterministic, this can drift. Add an explicitsortbefore estimation. - Working tree dirty — commit or stash before running this skill.
Notes
- This skill is destructive: it wipes everything except
data/raw/. Triple-check that step 3 succeeded withdata/raw/intact. - Long-running: full pipeline + diff. Run when you have time, not as a quick check.
- If the pipeline takes hours, consider running stage-by-stage diffs instead (compare
output/tables/<stage>between runs).