name: ml-feedback-ladder description: Use when planning how to verify an ML experiment cheaply before expensive runs - design the R0-R7 ladder from local sanity checks to full study, with promotion/stop criteria.
ML Feedback Ladder
Overview
In normal software, a passing test suite can mean the code works. In ML research, passing tests only show the code PATH might run - they do NOT show the method works. You need STAGED EMPIRICAL verification, ordered cheapest-to-expensive, where every cheap check GATES the expensive cluster/GPU jobs below it.
This skill OWNS the canonical ladder. Design the rungs for ONE specific experiment with your human partner before launching anything.
Core principle: Cheap checks gate expensive jobs. Never spend a slow rung to find a bug a fast rung would have caught.
Upstream: the experiment, metric, and protocol come from superpowers-ml:ml-experiment-design.
Downstream: the ladder you design here becomes verification steps in superpowers-ml:writing-plans, and the final rung hands off to superpowers-ml:ml-result-review.
The Ladder
Each rung names what it CHECKS, the ARTIFACT that proves it passed, and rough COST. Cost is relative - a rung is "expensive" if it consumes a scheduled GPU/cluster job.
| Rung | Checks | Proof artifact | Cost |
|---|---|---|---|
| R0 | Experiment card / protocol defined: question, locked primary metric, baseline, decision rule | The experiment card itself | minutes, no compute |
| R1 | Code / import / config / static sanity: it imports, config parses, paths resolve, seeds set | Clean import + config dump + linter | seconds, dev node |
| R2 | Shape / dtype / device / one-batch forward+backward: loss is finite, gradients flow | Logged shapes/dtypes/device + one non-NaN loss + non-zero grads | seconds-minutes, dev node |
| R3 | Tiny overfit: a handful of examples driven to ~zero loss (or memorized) | Loss curve collapsing to near-zero on the tiny set | minutes, dev node / 1 GPU |
| R4 | Real launcher smoke run: the ACTUAL launch path (local GPU or cluster smoke job) starts, checkpoints, logs, resumes - on tiny data/steps | Launcher exits 0, checkpoint written, logs/metrics emitted | one short job |
| R5 | Short pilot / early signal: real data, real config, truncated budget; metric is moving the right way and is stable | Early metric curve vs. baseline on the locked metric | a fraction of a full run |
| R6 | Full run / full study: the locked protocol at full budget, seeds/sweeps as specified | Complete metrics across all planned seeds/conditions | the expensive job(s) |
| R7 | Result review / decision memo: compare to baseline under the locked primary metric, decide | Decision memo (handed to superpowers-ml:ml-result-review) |
analysis time |
R0-R3 should run on your dev node in well under an hour. R4+ consume scheduled jobs - protect them.
Promotion and Stop Criteria
State the gate between EACH adjacent rung before you launch. A rung promotes ONLY when its proof artifact exists and is green.
- R0 -> R1: card has a single locked primary metric and an explicit decision rule. No metric, no launch.
- R1 -> R2: imports clean, config parses, paths/seeds resolved.
- R2 -> R3: one batch forward+backward, loss finite, gradients non-zero on the right devices.
- R3 -> R4: tiny set overfits. If it CANNOT overfit a handful of examples, the model/loss/data wiring is broken - fix before any GPU job.
- R4 -> R5: real launcher runs end-to-end on tiny budget, checkpoints, resumes, logs the metric.
- R5 -> R6: early signal is stable and not obviously worse than baseline. Promote to the expensive full run only here.
- R6 -> R7: all planned seeds/conditions complete; metrics intact, no silent failures.
STOP rule at every rung: if the proof artifact is missing or red, do NOT spend the next rung. Fix the cheap thing first.
Policy
State these plainly and hold to them:
- A cheap rung PASSING is a PRECONDITION, not proof of final success. R3 overfitting tells you the plumbing works; it tells you NOTHING about whether the method beats the baseline.
- A cheap rung FAILING means do NOT spend the expensive rung. Diagnose and fix at the lowest rung that reproduces the problem.
- Early signal (R5) may REJECT an obviously bad run - kill it, save the budget. Early signal must NOT claim victory. Only R6/R7 under the locked primary metric can support a "beats baseline" claim.
- Never skip a rung to "save time." A skipped fast rung is paid back as a burned slow job.
Scheduler-Agnostic
The launcher at R4+ is whatever your cluster uses. Slurm is ONE example (e.g. a small sbatch smoke job), not an assumption - the ladder is identical for a bare torchrun, a Ray/Kubernetes submission, or a plain SSH-to-GPU script. Design the rungs around YOUR launch path; do not hard-code a scheduler.
Reporting
Report progress as the highest GREEN rung, and separate what is supported from what is not:
Verified through R3 (tiny overfit). Not yet verified by smoke run, pilot, or full study.
Never report a method as beating a baseline without R6 (or an equivalent full evaluation) under the locked primary metric.