bio-machine-learning-model-validation

star 876

Validates predictive models on omics and biomedical data with nested cross-validation, group/batch/temporal-aware splits, the full data-leakage taxonomy, probability calibration, decision-curve net benefit, optimism correction, sample-size planning, and TRIPOD+AI reporting. Use when estimating model performance honestly, choosing a CV scheme, detecting leakage, or judging whether reported discrimination means the model is actually useful. For feature selection itself see machine-learning/biomarker-discovery; for confirmatory-trial inference see clinical-biostatistics/trial-reporting.

GPTomics By GPTomics schedule Updated 6/6/2026

name: bio-machine-learning-model-validation description: Validates predictive models on omics and biomedical data with nested cross-validation, group/batch/temporal-aware splits, the full data-leakage taxonomy, probability calibration, decision-curve net benefit, optimism correction, sample-size planning, and TRIPOD+AI reporting. Use when estimating model performance honestly, choosing a CV scheme, detecting leakage, or judging whether reported discrimination means the model is actually useful. For feature selection itself see machine-learning/biomarker-discovery; for confirmatory-trial inference see clinical-biostatistics/trial-reporting. tool_type: python primary_tool: sklearn

Version Compatibility

Reference examples tested with: numpy 1.26+, scikit-learn 1.4+ (note 1.6/1.8 API changes below).

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures

scikit-learn drift to watch: CalibratedClassifierCV(cv='prefit') was deprecated in 1.6 and removed in 1.8 (it now raises; wrap a fitted model in sklearn.frozen.FrozenEstimator instead); ensemble default became 'auto' in 1.6; method='temperature' was added in 1.8. If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Model Validation for Biomedical and Omics Data

"Validate my omics classifier honestly" -> Keep every data-dependent step inside the resampling loop, never use the same data to both choose and grade, and report calibration and net benefit, not just AUC.

  • Nested CV: GridSearchCV (inner) wrapped by cross_val_score (outer)
  • Group/structured: StratifiedGroupKFold, TimeSeriesSplit
  • Calibration: calibration_curve, brier_score_loss, CalibratedClassifierCV

The Single Most Important Modern Insight -- A Reported Number Is an Honest Estimate Only If Nothing Leaked and Nothing Was Graded on What Was Chosen

A reported performance number is a claim about a data-generating process that will never recur. Almost every inflated result in ML-for-biology traces to one of two root causes: information from the test distribution leaked into model construction, or the same data was used to both choose and grade a decision. A clean train/test split is necessary but nowhere near sufficient -- the leakage has usually already contaminated the test set (a scaler fit on all data, a duplicate patient, ComBat run across the split). Leakage causes a reproducibility crisis across ML-based science (Kapoor 2023), and the bias is largest exactly when the true signal is weakest -- the omics regime.

A second, equally load-bearing insight: discrimination (AUC/C) and calibration (do predicted probabilities match observed frequencies) are orthogonal. AUC is invariant to any monotone transform of the score, so it is blind to calibration. For any decision that uses the probability itself, calibration -- not AUC -- is the property that matters, and it is the one routinely ignored (Van Calster 2019, "the Achilles heel").

Leakage Taxonomy

Leakage type How it happens in omics Symptom Prevention
Preprocessing (most common, most missed) z-scoring, quantile/library normalization, ComBat/SVA, PCA, kNN/MICE imputation, VST fit on the full dataset before splitting Test performance suspiciously close to train; collapses on external data Fit every transform inside the CV fold via a Pipeline
Feature selection (severe special case) top-k DE genes / highest-variance / univariate filter chosen on all samples, then CV only the classifier Near-perfect CV from pure noise; unstable selected set Selection lives in the CV fold (Ambroise 2002)
Target / label a feature is a proxy for or downstream of the outcome (post-diagnosis labs, treatment-derived fields, a collection-site that tracks case/control) One feature dominates implausibly; fails when removed Audit temporal/causal admissibility; exclude post-outcome variables
Group / patient / replicate same patient, tumor, organoid, or technical replicate in train and test; KFold scatters them Inflated metrics that vanish under leave-one-group-out Split by the highest independent unit (GroupKFold/StratifiedGroupKFold)
Batch batch correlated with outcome and not respected in the split, or ComBat across the train/test boundary Model discriminates batches not biology; external batch destroys it Block the split by batch; never run unsupervised correction across the split
Temporal random-splitting time-ordered data; future-period statistics standardize the past Backtest beats prospective deployment Time-based split (TimeSeriesSplit); never shuffle first
Duplicate / homolog near-identical samples, augmented copies, public-dataset overlap, homologous sequences across the split Memorization passes as generalization Deduplicate / cluster-then-split before CV
Test-reuse / threshold repeatedly peeking to pick features, thresholds, "best epoch"; choosing the classification threshold on the test set Irreproducible SOTA; fragile config One locked test set; all tuning + thresholds inside nested CV

Decision Tree by Scenario

Scenario / generalization question Recommended scheme Why
"A new sample like training" (and any tuning occurs) Nested CV: inner GridSearchCV, outer cross_val_score, Pipeline inside Tuning and grading on the same CV is optimistic (Cawley-Talbot 2010)
"A new patient" (repeated measures) GroupKFold/StratifiedGroupKFold by patient/donor The unit of independence is not the row
"A new hospital/site" (transportability) Leave-one-site-out (internal-external CV) Approximates external validation
"Next year" (time-ordered) TimeSeriesSplit forward-chaining Random folds leak the future
Small n (dozens), need a stable estimate RepeatedStratifiedKFold (5x10) with an interval A single CV is one high-variance draw
Probabilities will drive a decision Add calibration + decision-curve net benefit AUC is blind to calibration and utility
Final evidence for a clinical model External/temporal validation + TRIPOD+AI report Internal CV cannot detect a whole-dataset confound
Choosing the features themselves -> machine-learning/biomarker-discovery Selection is its own discipline (run it inside the fold)
Confirmatory trial inference (HR, p-value) -> clinical-biostatistics/trial-reporting Estimand is a treatment effect, not a prediction

Nested Cross-Validation

Goal: Estimate the performance of the whole procedure (tuning + fit) without optimistic bias.

Approach: The inner loop does all tuning, feature selection, and threshold choice; the outer loop grades the winning configuration once on a fold it never touched. The reported number is the aggregate over outer folds; it answers "if I run this pipeline on new data, what do I get?"

from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([('scaler', StandardScaler()),
                 ('select', SelectKBest(f_classif)),         # re-fit per inner fold -> no leakage
                 ('clf', LogisticRegression(max_iter=5000))])
grid = {'select__k': [10, 50, 200], 'clf__C': [0.01, 0.1, 1]}

inner = StratifiedKFold(5, shuffle=True, random_state=0)
outer = StratifiedKFold(5, shuffle=True, random_state=1)
search = GridSearchCV(pipe, grid, cv=inner, scoring='roc_auc')
scores = cross_val_score(search, X, y, cv=outer, scoring='roc_auc')   # unbiased estimate
print(f'Nested AUC: {scores.mean():.3f} +/- {scores.std():.3f}')

Nested CV is needed whenever model selection happens -- even informal "tried three options, kept the best." Flat CV with tuning is a known reviewer red flag (Varma-Simon 2006).

Group-Aware, Structured, and Small-Sample CV

Goal: Match the CV scheme to the real unit of independence and get a variance-aware estimate.

Approach: Pass a grouping vector so no group spans folds; for tiny n, repeat stratified k-fold and report the spread, not a bare number. Standard KFold assumes i.i.d. rows, which biomedical data almost never satisfy.

from sklearn.model_selection import StratifiedGroupKFold, RepeatedStratifiedKFold, cross_val_score

groups = meta['patient_id'].values                          # multiple samples per patient
gcv = StratifiedGroupKFold(n_splits=5)                      # group-disjoint AND class-balanced
g_auc = cross_val_score(pipe, X, y, cv=gcv, groups=groups, scoring='roc_auc')

rcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=0)
r_auc = cross_val_score(pipe, X, y, cv=rcv, scoring='roc_auc')   # report an interval

Leave-one-out is high-variance and degenerate for ranking metrics (AUC is undefined on a size-1 test fold) -- prefer repeated stratified k-fold. The .632+ bootstrap (Efron-Tibshirani 1997) is a defensible alternative but is optimistic for zero-apparent-error learners; for internal validation of a single fixed model, bootstrap optimism correction is the cleaner choice.

Calibration vs Discrimination

Goal: Verify that predicted probabilities mean what they say, not just that they rank correctly.

Approach: Plot a reliability curve, score it with the proper Brier score, and recalibrate on a held-out fold if needed. AUC measures only ranking; the calibration slope (<1 signals overfitting) and the reliability curve localize the failure.

from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import brier_score_loss
from sklearn.frozen import FrozenEstimator                  # sklearn >=1.6

prob_true, prob_pred = calibration_curve(y_test, p_test, n_bins=10, strategy='quantile')
brier = brier_score_loss(y_test, p_test)                    # proper score: calibration + refinement

# Recalibrate a fitted model on a disjoint calibration fold (cv='prefit' deprecated in 1.6, removed in 1.8):
calibrated = CalibratedClassifierCV(FrozenEstimator(fitted_model), method='isotonic')
calibrated.fit(X_cal, y_cal)                                # X_cal disjoint from train and test

Calibration cautions: use strategy='quantile' (equal-mass bins) under imbalance; do not report a single Expected Calibration Error as ground truth -- equal-width ECE is biased and reports error even for perfectly calibrated models (Roelofs 2022). Use Platt (method='sigmoid') for small calibration sets, isotonic for hundreds-plus points. Recalibrating on the test set is leakage.

Net benefit / Decision Curve Analysis (Vickers-Elkin 2006): net_benefit = TP/n - (FP/n)*(pt/(1-pt)), where the threshold probability pt encodes the relative harm of a false positive. Plot it against treat-all and treat-none references; a model is clinically useful only where it sits above both. DCA requires good calibration to be valid and is the bridge from statistical performance to clinical usefulness -- a model can have high AUC yet zero net benefit at every plausible threshold.

Metric Selection in Imbalanced Data

Metric Use Trap
Accuracy Almost never headline it under imbalance At 5% prevalence, "always negative" scores 95%
AUC / C Discrimination, prevalence-independent Blind to calibration; not a usefulness measure
AUPRC (average precision) Rare-positive problems Baseline is the prevalence, not 0.5 -- state it (Saito 2015)
Brier / log-loss When probabilities are used Proper; not comparable across prevalences without scaling
MCC Balanced single-threshold summary Still threshold-dependent (Chicco 2020)
F1 Retrieval-style problems Ignores true negatives; assumes a cost ratio

The multiple-threshold problem: reporting the best F1/accuracy over thresholds is optimistic, and choosing that threshold on the test set is leakage. Pick the operating point on a separate fold (or by net benefit), then report the locked-threshold metric once; prefer threshold-free curves (ROC, PR, calibration) plus one pre-specified operating point.

External Validation, Optimism, Sample Size, and TRIPOD+AI

  • Internal vs external. Internal validation (bootstrap optimism correction, repeated/nested CV) estimates reproducibility on new patients from the same source; external validation (different time, place, setting) estimates transportability and is the usual point of failure -- calibration degrades first (slope <1, intercept shift). Internal-external CV (leave-one-cluster-out) is the recommendation when multiple cohorts exist (Steyerberg 2001).
  • Optimism and shrinkage. Apparent performance overstates the future; the gap (optimism) grows with more predictors, more flexibility, smaller n. Remedy with a uniform shrinkage factor (the bootstrap calibration slope) or penalized estimation. A development-data calibration slope <1 is the optimism signal.
  • Sample size. The "10 events per variable" heuristic (Peduzzi 1996) is obsolete; the standard is Riley et al.'s minimum-sample-size framework (2019, Stat Med Parts I-II), which sizes for shrinkage >=0.9 and precise risk estimation (pmsampsize). For p>>n omics these formulas are out of regime, which is precisely why heavy penalization + nested validation, not unpenalized multivariable fits, are mandatory.
  • Reporting. TRIPOD+AI (Collins 2024, BMJ 385:e078378) supersedes TRIPOD 2015 and is the 2024+ target for any biomedical predictive-model claim -- it demands data-splitting and leakage controls, calibration (not just discrimination), fairness/subgroup performance, and uncertainty. PROBAST+AI is the companion risk-of-bias appraisal.

Per-Method Failure Modes

Preprocessing fit before the split

  • Trigger: StandardScaler().fit_transform(X) (or ComBat, PCA, imputation) on all data, then CV.
  • Mechanism: The fitted parameters encode the test rows.
  • Symptom: Test variance tiny; drop on external data.
  • Fix: Put every transform in the Pipeline so fit only sees training folds.

Threshold or best-of-many chosen on the test set

  • Trigger: Reporting the best F1 over thresholds, or the best of several CV runs.
  • Mechanism: Each peek leaks; over many tries the test set becomes a training set.
  • Symptom: Irreproducible "SOTA"; a fresh test set disappoints.
  • Fix: Lock one test set, pre-specify metric and threshold rule, choose thresholds on a separate fold.

SMOTE/resampling to fix imbalance breaks calibration

  • Trigger: Oversampling/SMOTE for a risk model.
  • Mechanism: Changing training prevalence inflates minority-class probabilities; no AUC gain (van den Goorbergh 2022).
  • Symptom: Good AUC, badly miscalibrated risks.
  • Fix: Do not resample for probability models; move the threshold on a calibrated model. If resampled, use imblearn.pipeline.Pipeline (train-fold only).

LOO for a ranking metric

  • Trigger: Leave-one-out with AUC.
  • Mechanism: AUC is undefined within a size-1 fold; pooling OOF predictions then scoring once is not equivalent to averaging fold scores for non-decomposable metrics.
  • Symptom: Unstable or misleading AUC.
  • Fix: Use repeated stratified k-fold; reserve cross_val_predict for visuals, not the headline metric.

Quantitative Thresholds

Threshold Source Rationale
Selection/preprocessing inside every fold; nested CV for tuning Ambroise 2002; Varma-Simon 2006 Same-data tune-and-grade is optimistic
Repeated 5-fold x ~10, report the spread field standard A single CV is one high-variance draw at small n
Calibration slope ~1; <1 means overfitting Van Calster 2019 Basis of shrinkage
Sample size from Riley framework (shrinkage >=0.9) Riley 2019 "10 EPV" is obsolete
Report per TRIPOD+AI Collins 2024 2024+ standard: discrimination + calibration + fairness

Common Errors

Error / symptom Cause Solution
cv='prefit' warns or errors Deprecated in 1.6, removed in 1.8 (now raises) Wrap the fitted model in FrozenEstimator
Calibration looks perfect on the test set Calibrated on the evaluation data Calibrate on a disjoint fold
Scaler/selector fit outside the Pipeline Preprocessing leakage Move into the Pipeline passed to cross_val_score/GridSearchCV
groups= ignored Not threaded to the splitter Pass groups= to cross_validate/GridSearchCV.fit
cross_val_predict used as the headline AUC Non-decomposable metric over pooled OOF Average per-fold scores instead

References

  • Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. 1996. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49:1373-1379.
  • Efron B, Tibshirani R. 1997. Improvements on cross-validation: the .632+ bootstrap method. J Am Stat Assoc 92:548-560.
  • Steyerberg EW, Harrell FE, Borsboom GJ, et al. 2001. Internal validation of predictive models. J Clin Epidemiol 54:774-781.
  • Ambroise C, McLachlan GJ. 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99:6562-6566.
  • Vickers AJ, Elkin EB. 2006. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 26:565-574.
  • Varma S, Simon R. 2006. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7:91.
  • Cawley GC, Talbot NLC. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079-2107.
  • Saito T, Rehmsmeier M. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10:e0118432.
  • Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. 2019. Calibration: the Achilles heel of predictive analytics. BMC Med 17:230.
  • Riley RD, Snell KIE, Ensor J, et al. 2019. Minimum sample size for developing a multivariable prediction model: Parts I-II. Stat Med 38:1262-1296.
  • Chicco D, Jurman G. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6.
  • Roelofs R, Cain N, Shlens J, Mozer MC. 2022. Mitigating bias in calibration error estimation. Proc AISTATS PMLR 151:4036-4054.
  • van den Goorbergh R, van Smeden M, Timmerman D, Van Calster B. 2022. The harm of class imbalance corrections for risk prediction models. J Am Med Inform Assoc 29:1525-1534.
  • Whalen S, Schreiber J, Noble WS, Pollard KS. 2022. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 23:169-181.
  • Kapoor S, Narayanan A. 2023. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4:100804.
  • Collins GS, Moons KGM, Dhiman P, et al. 2024. TRIPOD+AI statement. BMJ 385:e078378.

Related Skills

  • machine-learning/biomarker-discovery - Feature selection run inside the CV fold
  • machine-learning/omics-classifiers - Model training, calibration directions, and imbalance handling
  • machine-learning/survival-analysis - Validation metrics for time-to-event models
  • experimental-design/batch-design - Designing out batch-outcome confounding before analysis
  • experimental-design/multiple-testing - FDR control for high-dimensional testing
  • clinical-biostatistics/trial-reporting - Confirmatory-trial reporting and the prediction-vs-inference boundary
Install via CLI
npx skills add https://github.com/GPTomics/bioSkills --skill bio-machine-learning-model-validation
Repository Details
star Stars 876
call_split Forks 156
navigation Branch main
article Path SKILL.md
More from Creator