ml-result-review - SKILL.md Agent Skill

name: ml-result-review description: Use when concluding an ML experiment - produce a sober conclusion that separates confirmed results from exploratory observations and states the highest verified rung.

ML Result Review

Overview

You are concluding an ML experiment, not selling it. Passing tests showed the code path runs; they never showed the method works. Your job here is a SOBER conclusion that resists over-claiming, written so your human partner can decide on real evidence.

Core principle: Confirmed, exploratory, and failed are different things. Keep them apart.

This is downstream of superpowers-ml:ml-feedback-ladder (which owns the rungs) and superpowers-ml:verification-before-completion (evidence before claims). Cite the highest rung you actually reached, never the one you hoped for.

What "Confirmed" Means

A result is CONFIRMED only when the required full or equivalent evaluation completed under the declared primary metric and met the locked success criterion.

Met the locked criterion under the locked metric = confirmed.
Better-looking but not the locked test = exploratory, not confirmed.
Operational interruptions (preemption, requeue, node failure) are not scientific failures. A requeued run that still completed the locked evaluation is confirmed. Note the interruption; do not downgrade the result for it.

Never claim the method beats a baseline without the full or equivalent evaluation under the locked primary metric.

The Review Format

Write the conclusion under these explicit headings, in order:

CONFIRMED

What met the locked success criterion under the primary metric. One line per claim, each tied to the evaluation that proved it. If nothing is confirmed, write "None."

EXPLORATORY

Interesting observations that were NOT the locked test: secondary metrics, partial runs, eyeballed curves, single seeds. Label clearly so no one mistakes these for results.

FAILED / INCOMPLETE

Runs that did not finish the locked evaluation, and WHY. Distinguish scientific failure (method/config wrong) from operational interruption (infra). An operational interruption belongs here only if it actually prevented the evaluation from completing.

HIGHEST VERIFIED RUNG

State the top rung (R0-R7) actually reached and the artifact that proves it (log path, metric value, checkpoint, decision memo). Use the canonical wording from superpowers-ml:ml-feedback-ladder.

EVIDENCE GAPS

What remains unproven. What a skeptic would still doubt. Which rungs are not yet climbed.

RECOMMENDED NEXT

The single next experiment that closes the most important gap. Name its target rung. This is what you hand your human partner to decide on.

Force The Supported / Not-Supported Split

Every conclusion must separate what the evidence supports from what it does not. Model the language exactly:

Supported:         implementation trains stably through R5 pilot.
Not supported yet: variant beats baseline at full scale.
Exploratory:       early loss looked better, but this was not the
                   locked primary metric.

If you cannot phrase a claim as "Supported:" with its evidence, it is not supported. Demote it.

Reporting Line

Close with the standard rung statement, naming the highest GREEN rung and what is still unverified above it, e.g.:

Verified through R5 (short pilot). Not yet verified by full study.

Red Flags - STOP

"It works" / "the method wins" without the locked full evaluation
Quoting a secondary or early metric as if it were the locked one
Treating a preemption or requeue as a scientific failure
Claiming a rung you did not reach, or omitting the artifact for it
An empty EVIDENCE GAPS section (there are always gaps)
A conclusion with no RECOMMENDED NEXT

The Bottom Line

Separate confirmed from exploratory from failed. State the highest rung you actually climbed, with its artifact. Name what is still unproven.

A sober "not supported yet" is a real result. An over-claim is a liability.