name: bio-machine-learning-model-validation description: Validates predictive models on omics and biomedical data with nested cross-validation, group/batch/temporal-aware splits, the full data-leakage taxonomy, probability calibration, decision-curve net benefit, optimism correction, sample-size planning, and TRIPOD+AI reporting. Use when estimating model performance honestly, choosing a CV scheme, detecting leakage, or judging whether reported discrimination means the model is actually useful. For feature selection itself see machine-learning/biomarker-discovery; for confirmatory-trial inference see clinical-biostatistics/trial-reporting. tool_type: python primary_tool: sklearn
Version Compatibility
Reference examples tested with: numpy 1.26+, scikit-learn 1.4+ (note 1.6/1.8 API changes below).
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package>thenhelp(module.function)to check signatures
scikit-learn drift to watch: CalibratedClassifierCV(cv='prefit') was deprecated in 1.6 and removed in 1.8 (it now raises; wrap a fitted model in sklearn.frozen.FrozenEstimator instead); ensemble default became 'auto' in 1.6; method='temperature' was added in 1.8. If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Model Validation for Biomedical and Omics Data
"Validate my omics classifier honestly" -> Keep every data-dependent step inside the resampling loop, never use the same data to both choose and grade, and report calibration and net benefit, not just AUC.
- Nested CV:
GridSearchCV(inner) wrapped bycross_val_score(outer) - Group/structured:
StratifiedGroupKFold,TimeSeriesSplit - Calibration:
calibration_curve,brier_score_loss,CalibratedClassifierCV
The Single Most Important Modern Insight -- A Reported Number Is an Honest Estimate Only If Nothing Leaked and Nothing Was Graded on What Was Chosen
A reported performance number is a claim about a data-generating process that will never recur. Almost every inflated result in ML-for-biology traces to one of two root causes: information from the test distribution leaked into model construction, or the same data was used to both choose and grade a decision. A clean train/test split is necessary but nowhere near sufficient -- the leakage has usually already contaminated the test set (a scaler fit on all data, a duplicate patient, ComBat run across the split). Leakage causes a reproducibility crisis across ML-based science (Kapoor 2023), and the bias is largest exactly when the true signal is weakest -- the omics regime.
A second, equally load-bearing insight: discrimination (AUC/C) and calibration (do predicted probabilities match observed frequencies) are orthogonal. AUC is invariant to any monotone transform of the score, so it is blind to calibration. For any decision that uses the probability itself, calibration -- not AUC -- is the property that matters, and it is the one routinely ignored (Van Calster 2019, "the Achilles heel").
Leakage Taxonomy
| Leakage type | How it happens in omics | Symptom | Prevention |
|---|---|---|---|
| Preprocessing (most common, most missed) | z-scoring, quantile/library normalization, ComBat/SVA, PCA, kNN/MICE imputation, VST fit on the full dataset before splitting | Test performance suspiciously close to train; collapses on external data | Fit every transform inside the CV fold via a Pipeline |
| Feature selection (severe special case) | top-k DE genes / highest-variance / univariate filter chosen on all samples, then CV only the classifier | Near-perfect CV from pure noise; unstable selected set | Selection lives in the CV fold (Ambroise 2002) |
| Target / label | a feature is a proxy for or downstream of the outcome (post-diagnosis labs, treatment-derived fields, a collection-site that tracks case/control) | One feature dominates implausibly; fails when removed | Audit temporal/causal admissibility; exclude post-outcome variables |
| Group / patient / replicate | same patient, tumor, organoid, or technical replicate in train and test; KFold scatters them |
Inflated metrics that vanish under leave-one-group-out | Split by the highest independent unit (GroupKFold/StratifiedGroupKFold) |
| Batch | batch correlated with outcome and not respected in the split, or ComBat across the train/test boundary | Model discriminates batches not biology; external batch destroys it | Block the split by batch; never run unsupervised correction across the split |
| Temporal | random-splitting time-ordered data; future-period statistics standardize the past | Backtest beats prospective deployment | Time-based split (TimeSeriesSplit); never shuffle first |
| Duplicate / homolog | near-identical samples, augmented copies, public-dataset overlap, homologous sequences across the split | Memorization passes as generalization | Deduplicate / cluster-then-split before CV |
| Test-reuse / threshold | repeatedly peeking to pick features, thresholds, "best epoch"; choosing the classification threshold on the test set | Irreproducible SOTA; fragile config | One locked test set; all tuning + thresholds inside nested CV |
Decision Tree by Scenario
| Scenario / generalization question | Recommended scheme | Why |
|---|---|---|
| "A new sample like training" (and any tuning occurs) | Nested CV: inner GridSearchCV, outer cross_val_score, Pipeline inside |
Tuning and grading on the same CV is optimistic (Cawley-Talbot 2010) |
| "A new patient" (repeated measures) | GroupKFold/StratifiedGroupKFold by patient/donor |
The unit of independence is not the row |
| "A new hospital/site" (transportability) | Leave-one-site-out (internal-external CV) | Approximates external validation |
| "Next year" (time-ordered) | TimeSeriesSplit forward-chaining |
Random folds leak the future |
| Small n (dozens), need a stable estimate | RepeatedStratifiedKFold (5x10) with an interval |
A single CV is one high-variance draw |
| Probabilities will drive a decision | Add calibration + decision-curve net benefit | AUC is blind to calibration and utility |
| Final evidence for a clinical model | External/temporal validation + TRIPOD+AI report | Internal CV cannot detect a whole-dataset confound |
| Choosing the features themselves | -> machine-learning/biomarker-discovery | Selection is its own discipline (run it inside the fold) |
| Confirmatory trial inference (HR, p-value) | -> clinical-biostatistics/trial-reporting | Estimand is a treatment effect, not a prediction |
Nested Cross-Validation
Goal: Estimate the performance of the whole procedure (tuning + fit) without optimistic bias.
Approach: The inner loop does all tuning, feature selection, and threshold choice; the outer loop grades the winning configuration once on a fold it never touched. The reported number is the aggregate over outer folds; it answers "if I run this pipeline on new data, what do I get?"
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([('scaler', StandardScaler()),
('select', SelectKBest(f_classif)), # re-fit per inner fold -> no leakage
('clf', LogisticRegression(max_iter=5000))])
grid = {'select__k': [10, 50, 200], 'clf__C': [0.01, 0.1, 1]}
inner = StratifiedKFold(5, shuffle=True, random_state=0)
outer = StratifiedKFold(5, shuffle=True, random_state=1)
search = GridSearchCV(pipe, grid, cv=inner, scoring='roc_auc')
scores = cross_val_score(search, X, y, cv=outer, scoring='roc_auc') # unbiased estimate
print(f'Nested AUC: {scores.mean():.3f} +/- {scores.std():.3f}')
Nested CV is needed whenever model selection happens -- even informal "tried three options, kept the best." Flat CV with tuning is a known reviewer red flag (Varma-Simon 2006).
Group-Aware, Structured, and Small-Sample CV
Goal: Match the CV scheme to the real unit of independence and get a variance-aware estimate.
Approach: Pass a grouping vector so no group spans folds; for tiny n, repeat stratified k-fold and report the spread, not a bare number. Standard KFold assumes i.i.d. rows, which biomedical data almost never satisfy.
from sklearn.model_selection import StratifiedGroupKFold, RepeatedStratifiedKFold, cross_val_score
groups = meta['patient_id'].values # multiple samples per patient
gcv = StratifiedGroupKFold(n_splits=5) # group-disjoint AND class-balanced
g_auc = cross_val_score(pipe, X, y, cv=gcv, groups=groups, scoring='roc_auc')
rcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=0)
r_auc = cross_val_score(pipe, X, y, cv=rcv, scoring='roc_auc') # report an interval
Leave-one-out is high-variance and degenerate for ranking metrics (AUC is undefined on a size-1 test fold) -- prefer repeated stratified k-fold. The .632+ bootstrap (Efron-Tibshirani 1997) is a defensible alternative but is optimistic for zero-apparent-error learners; for internal validation of a single fixed model, bootstrap optimism correction is the cleaner choice.
Calibration vs Discrimination
Goal: Verify that predicted probabilities mean what they say, not just that they rank correctly.
Approach: Plot a reliability curve, score it with the proper Brier score, and recalibrate on a held-out fold if needed. AUC measures only ranking; the calibration slope (<1 signals overfitting) and the reliability curve localize the failure.
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import brier_score_loss
from sklearn.frozen import FrozenEstimator # sklearn >=1.6
prob_true, prob_pred = calibration_curve(y_test, p_test, n_bins=10, strategy='quantile')
brier = brier_score_loss(y_test, p_test) # proper score: calibration + refinement
# Recalibrate a fitted model on a disjoint calibration fold (cv='prefit' deprecated in 1.6, removed in 1.8):
calibrated = CalibratedClassifierCV(FrozenEstimator(fitted_model), method='isotonic')
calibrated.fit(X_cal, y_cal) # X_cal disjoint from train and test
Calibration cautions: use strategy='quantile' (equal-mass bins) under imbalance; do not report a single Expected Calibration Error as ground truth -- equal-width ECE is biased and reports error even for perfectly calibrated models (Roelofs 2022). Use Platt (method='sigmoid') for small calibration sets, isotonic for hundreds-plus points. Recalibrating on the test set is leakage.
Net benefit / Decision Curve Analysis (Vickers-Elkin 2006): net_benefit = TP/n - (FP/n)*(pt/(1-pt)), where the threshold probability pt encodes the relative harm of a false positive. Plot it against treat-all and treat-none references; a model is clinically useful only where it sits above both. DCA requires good calibration to be valid and is the bridge from statistical performance to clinical usefulness -- a model can have high AUC yet zero net benefit at every plausible threshold.
Metric Selection in Imbalanced Data
| Metric | Use | Trap |
|---|---|---|
| Accuracy | Almost never headline it under imbalance | At 5% prevalence, "always negative" scores 95% |
| AUC / C | Discrimination, prevalence-independent | Blind to calibration; not a usefulness measure |
| AUPRC (average precision) | Rare-positive problems | Baseline is the prevalence, not 0.5 -- state it (Saito 2015) |
| Brier / log-loss | When probabilities are used | Proper; not comparable across prevalences without scaling |
| MCC | Balanced single-threshold summary | Still threshold-dependent (Chicco 2020) |
| F1 | Retrieval-style problems | Ignores true negatives; assumes a cost ratio |
The multiple-threshold problem: reporting the best F1/accuracy over thresholds is optimistic, and choosing that threshold on the test set is leakage. Pick the operating point on a separate fold (or by net benefit), then report the locked-threshold metric once; prefer threshold-free curves (ROC, PR, calibration) plus one pre-specified operating point.
External Validation, Optimism, Sample Size, and TRIPOD+AI
- Internal vs external. Internal validation (bootstrap optimism correction, repeated/nested CV) estimates reproducibility on new patients from the same source; external validation (different time, place, setting) estimates transportability and is the usual point of failure -- calibration degrades first (slope <1, intercept shift). Internal-external CV (leave-one-cluster-out) is the recommendation when multiple cohorts exist (Steyerberg 2001).
- Optimism and shrinkage. Apparent performance overstates the future; the gap (optimism) grows with more predictors, more flexibility, smaller n. Remedy with a uniform shrinkage factor (the bootstrap calibration slope) or penalized estimation. A development-data calibration slope <1 is the optimism signal.
- Sample size. The "10 events per variable" heuristic (Peduzzi 1996) is obsolete; the standard is Riley et al.'s minimum-sample-size framework (2019, Stat Med Parts I-II), which sizes for shrinkage >=0.9 and precise risk estimation (
pmsampsize). For p>>n omics these formulas are out of regime, which is precisely why heavy penalization + nested validation, not unpenalized multivariable fits, are mandatory. - Reporting. TRIPOD+AI (Collins 2024, BMJ 385:e078378) supersedes TRIPOD 2015 and is the 2024+ target for any biomedical predictive-model claim -- it demands data-splitting and leakage controls, calibration (not just discrimination), fairness/subgroup performance, and uncertainty. PROBAST+AI is the companion risk-of-bias appraisal.
Per-Method Failure Modes
Preprocessing fit before the split
- Trigger:
StandardScaler().fit_transform(X)(or ComBat, PCA, imputation) on all data, then CV. - Mechanism: The fitted parameters encode the test rows.
- Symptom: Test variance tiny; drop on external data.
- Fix: Put every transform in the Pipeline so
fitonly sees training folds.
Threshold or best-of-many chosen on the test set
- Trigger: Reporting the best F1 over thresholds, or the best of several CV runs.
- Mechanism: Each peek leaks; over many tries the test set becomes a training set.
- Symptom: Irreproducible "SOTA"; a fresh test set disappoints.
- Fix: Lock one test set, pre-specify metric and threshold rule, choose thresholds on a separate fold.
SMOTE/resampling to fix imbalance breaks calibration
- Trigger: Oversampling/SMOTE for a risk model.
- Mechanism: Changing training prevalence inflates minority-class probabilities; no AUC gain (van den Goorbergh 2022).
- Symptom: Good AUC, badly miscalibrated risks.
- Fix: Do not resample for probability models; move the threshold on a calibrated model. If resampled, use
imblearn.pipeline.Pipeline(train-fold only).
LOO for a ranking metric
- Trigger: Leave-one-out with AUC.
- Mechanism: AUC is undefined within a size-1 fold; pooling OOF predictions then scoring once is not equivalent to averaging fold scores for non-decomposable metrics.
- Symptom: Unstable or misleading AUC.
- Fix: Use repeated stratified k-fold; reserve
cross_val_predictfor visuals, not the headline metric.
Quantitative Thresholds
| Threshold | Source | Rationale |
|---|---|---|
| Selection/preprocessing inside every fold; nested CV for tuning | Ambroise 2002; Varma-Simon 2006 | Same-data tune-and-grade is optimistic |
| Repeated 5-fold x ~10, report the spread | field standard | A single CV is one high-variance draw at small n |
| Calibration slope ~1; <1 means overfitting | Van Calster 2019 | Basis of shrinkage |
| Sample size from Riley framework (shrinkage >=0.9) | Riley 2019 | "10 EPV" is obsolete |
| Report per TRIPOD+AI | Collins 2024 | 2024+ standard: discrimination + calibration + fairness |
Common Errors
| Error / symptom | Cause | Solution |
|---|---|---|
cv='prefit' warns or errors |
Deprecated in 1.6, removed in 1.8 (now raises) | Wrap the fitted model in FrozenEstimator |
| Calibration looks perfect on the test set | Calibrated on the evaluation data | Calibrate on a disjoint fold |
| Scaler/selector fit outside the Pipeline | Preprocessing leakage | Move into the Pipeline passed to cross_val_score/GridSearchCV |
groups= ignored |
Not threaded to the splitter | Pass groups= to cross_validate/GridSearchCV.fit |
cross_val_predict used as the headline AUC |
Non-decomposable metric over pooled OOF | Average per-fold scores instead |
References
- Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. 1996. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49:1373-1379.
- Efron B, Tibshirani R. 1997. Improvements on cross-validation: the .632+ bootstrap method. J Am Stat Assoc 92:548-560.
- Steyerberg EW, Harrell FE, Borsboom GJ, et al. 2001. Internal validation of predictive models. J Clin Epidemiol 54:774-781.
- Ambroise C, McLachlan GJ. 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99:6562-6566.
- Vickers AJ, Elkin EB. 2006. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 26:565-574.
- Varma S, Simon R. 2006. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7:91.
- Cawley GC, Talbot NLC. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079-2107.
- Saito T, Rehmsmeier M. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10:e0118432.
- Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. 2019. Calibration: the Achilles heel of predictive analytics. BMC Med 17:230.
- Riley RD, Snell KIE, Ensor J, et al. 2019. Minimum sample size for developing a multivariable prediction model: Parts I-II. Stat Med 38:1262-1296.
- Chicco D, Jurman G. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6.
- Roelofs R, Cain N, Shlens J, Mozer MC. 2022. Mitigating bias in calibration error estimation. Proc AISTATS PMLR 151:4036-4054.
- van den Goorbergh R, van Smeden M, Timmerman D, Van Calster B. 2022. The harm of class imbalance corrections for risk prediction models. J Am Med Inform Assoc 29:1525-1534.
- Whalen S, Schreiber J, Noble WS, Pollard KS. 2022. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 23:169-181.
- Kapoor S, Narayanan A. 2023. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4:100804.
- Collins GS, Moons KGM, Dhiman P, et al. 2024. TRIPOD+AI statement. BMJ 385:e078378.
Related Skills
- machine-learning/biomarker-discovery - Feature selection run inside the CV fold
- machine-learning/omics-classifiers - Model training, calibration directions, and imbalance handling
- machine-learning/survival-analysis - Validation metrics for time-to-event models
- experimental-design/batch-design - Designing out batch-outcome confounding before analysis
- experimental-design/multiple-testing - FDR control for high-dimensional testing
- clinical-biostatistics/trial-reporting - Confirmatory-trial reporting and the prediction-vs-inference boundary