name: tooluniverse-diagnostic-test-evaluation description: Diagnostic test / biomarker accuracy — sensitivity, specificity, PPV, NPV, likelihood ratios, accuracy from a 2x2 table; ROC curve, AUC, and the optimal cutoff (Youden) for a continuous biomarker; and post-test probability via Bayes. Use when you have test results vs a gold standard (binary 2x2, or a continuous score + true labels) and need to judge how good the test is, pick a threshold, or compute the probability of disease given a result. Emphasizes the prevalence-dependence of PPV/NPV. disable-model-invocation: true
Diagnostic Test / Biomarker Accuracy Evaluation
Judge how well a test or biomarker discriminates disease — at a fixed cutoff (2×2) or across all cutoffs (ROC) — and turn a result into a probability of disease.
Which case are you in?
| You have… | Go to |
|---|---|
| A 2×2 table (TP/FP/TN/FN) at a fixed cutoff | Step 1 (Epidemiology_diagnostic) |
| A continuous biomarker score + true labels | Step 2 (ROC / AUC / Youden, Python) |
| A test's sens/spec + a patient's pre-test probability | Step 3 (Epidemiology_bayesian) |
Step 1 — Fixed-cutoff metrics from a 2×2 table
tu run Epidemiology_diagnostic '{"operation":"diagnostic","tp":90,"fp":10,"tn":180,"fn":20}'
Returns sensitivity, specificity, PPV, NPV, accuracy, LR_pos, LR_neg, and the sample prevalence.
| Metric | Question it answers | Depends on prevalence? |
|---|---|---|
| Sensitivity = TP/(TP+FN) | Of those WITH disease, what fraction test positive? | No |
| Specificity = TN/(TN+FP) | Of those WITHOUT disease, what fraction test negative? | No |
| PPV = TP/(TP+FP) | If positive, what's the chance of disease? | Yes — strongly |
| NPV = TN/(TN+FN) | If negative, what's the chance of being disease-free? | Yes |
| LR+ = sens/(1−spec) | How much a positive raises the odds of disease | No |
| LR− = (1−sens)/spec | How much a negative lowers the odds | No |
The PPV/NPV trap. Sensitivity and specificity are properties of the test; PPV and NPV depend on the disease prevalence in the tested population. A test with great sens/spec has poor PPV in a low-prevalence (screening) setting. Never quote PPV/NPV from a case-control design (its 50/50 prevalence is artificial) — compute them for the real-world prevalence with
Epidemiology_bayesian(Step 3). Report sensitivity, specificity, and likelihood ratios as the prevalence-independent summary.
Step 2 — ROC / AUC / optimal cutoff for a continuous biomarker
When the test is a continuous score, evaluate across all thresholds:
python skills/tooluniverse-diagnostic-test-evaluation/scripts/roc_analysis.py --input scores.csv
# scores.csv columns: label (1=disease, 0=healthy), score (continuous biomarker)
It reports AUC (with a bootstrap 95% CI), the Youden-optimal cutoff (max sensitivity+specificity−1) and its sens/spec, and a text ROC curve.
| AUC | Discrimination |
|---|---|
| 0.5 | no better than chance |
| 0.7–0.8 | acceptable |
| 0.8–0.9 | excellent |
| >0.9 | outstanding |
- The Youden cutoff weights sensitivity and specificity equally; if false negatives and false positives have different costs, pick the threshold from the clinical tradeoff, not Youden.
- Once you choose a cutoff, build its 2×2 and run Step 1 for the fixed-cutoff metrics at that operating point.
Step 3 — Post-test probability (Bayes)
Turn a result into the probability of disease for a given pre-test probability/prevalence:
tu run Epidemiology_bayesian '{"operation":"bayesian","prevalence":0.10,
"sensitivity":0.90,"specificity":0.95,"test_result":"positive"}'
Returns pre_test_odds, the LR, and post_test_probability. This is how you get the real-world PPV: plug the true prevalence in. (Example: a 90%/95% test at 10% prevalence gives a post-positive probability of only ~67%, not 95%.)
Gotchas (state these)
- PPV/NPV without a stated prevalence are meaningless — always give the prevalence they assume.
- AUC ignores the operating point. A high AUC doesn't tell you the test is useful at the threshold you'll actually use — report sens/spec at the chosen cutoff too.
- Class imbalance. With very few positives, ROC/AUC can look good while PPV is poor; consider a precision-recall curve and always report PPV at the real prevalence.
- Spectrum bias. Sens/spec measured on clearly-sick vs clearly-healthy subjects overestimate real-world performance on borderline cases.
- Single cutoff chosen on the same data it's evaluated on is optimistic — validate the threshold on a held-out set.
Honest limitations
- These are discrimination/accuracy metrics, not calibration — a well-discriminating model can still output poorly-calibrated probabilities.
- A single AUC compares nothing; to compare two tests on the same patients, use a paired AUC test (DeLong) — beyond the basic script here.
Related skills
tooluniverse-statistical-modeling— logistic regression that produces the score, ORs.tooluniverse-epidemiological-analysis— population-level risk, screening program metrics.tooluniverse-meta-analysis— pool diagnostic accuracy across studies.