result-validation-loop - SKILL.md Agent Skill

name: result-validation-loop description: Validate results through statistical testing, ROPE judgment, reproducibility re-runs, and final synthesis version: 1.0.0 category: experiment-execution type: tactic orchestrates:

result-collection
statistical-testing
reproducibility-verification
execution-synthesis dependencies: sops:
- execution-synthesis
- reproducibility-verification
- result-collection
- statistical-testing

Tactic: Result Validation Loop

Orchestration Pattern

FUNCTION result_validation_loop(raw_results, experiment_design):
    // Phase 1: Collect and structure
    structured = SPAWN result-collection(raw_results)
    VALIDATE structured.complete
    
    // Phase 2: Statistical testing
    stats = SPAWN statistical-testing(structured, experiment_design.hypotheses)
    
    // Phase 3: ROPE Judgment
    rope = experiment_design.rope  // pre-registered ROPE bounds
    
    IF stats.posterior_in_rope > 0.95:
        judgment = "ACCEPT_NULL"  // practically equivalent
    ELIF stats.posterior_above_rope > 0.95:
        judgment = "REJECT_NULL"  // meaningful effect detected
    ELSE:
        judgment = "UNDECIDED"  // need more data
    END
    
    // Phase 4: Reproducibility verification
    IF judgment != "UNDECIDED":
        repro = SPAWN reproducibility-verification(
            experiment_design,
            n_reruns = 3,
            seeds = [42, 123, 7]
        )
        
        IF repro.icc < 0.5:
            judgment = "NOT_REPRODUCIBLE"
        ELIF repro.icc < 0.75:
            judgment = judgment + "_PARTIAL_REPRO"
        ELSE:
            judgment = judgment + "_REPRODUCIBLE"
        END
    ELSE:
        // Undecided — still run reproducibility to check if issue is noise
        repro = SPAWN reproducibility-verification(
            experiment_design,
            n_reruns = 5,  // more runs for undecided cases
            seeds = [42, 123, 7, 256, 999]
        )
        
        IF repro.variance_explained_by_seed > 0.5:
            judgment = "HIGH_VARIANCE_ACROSS_SEEDS"
        END
    END
    
    // Phase 5: Synthesis
    report = SPAWN execution-synthesis({
        structured_results: structured,
        statistical_tests: stats,
        judgment: judgment,
        reproducibility: repro,
        experiment_design: experiment_design
    })
    
    RETURN report
END

Decision Criteria

Condition	Action
Results incomplete (missing tasks)	Report gaps, analyze available data
P(in ROPE) > 95%	Accept null (no practical difference)
P(above ROPE) > 95%	Reject null (meaningful effect)
Neither threshold met	Undecided — recommend more data
ICC > 0.75	Results reproducible
ICC 0.5-0.75	Partially reproducible — flag
ICC < 0.5	Not reproducible — investigate sources of variance
High seed-dependent variance	Report instability, recommend investigation

Quality Gates

Before producing final synthesis:

All statistical tests report effect sizes (not just p-values)
Confidence intervals are provided for all estimates
ROPE was defined before analysis (not post-hoc)
At least 3 reproducibility re-runs completed
Limitations are explicitly stated
Next steps are actionable

Available SOPs

Optional, no fixed order; the final leaf is always a sop.

SOP	When to use
execution-synthesis	Synthesize complete execution report from all results, tests, and reproducibility data
reproducibility-verification	Verify result reproducibility via re-runs with different seeds and ICC comparison
result-collection	Collect experiment outputs — metrics, logs, artifacts — into structured result set
statistical-testing	Execute statistical tests — bootstrap, permutation, Bayesian ROPE — on experiment results