tooluniverse-diagnostic-test-evaluation

star 1.5k

Diagnostic test / biomarker accuracy — sensitivity, specificity, PPV, NPV, likelihood ratios, accuracy from a 2x2 table; ROC curve, AUC, and the optimal cutoff (Youden) for a continuous biomarker; and post-test probability via Bayes. Use when you have test results vs a gold standard (binary 2x2, or a continuous score + true labels) and need to judge how good the test is, pick a threshold, or compute the probability of disease given a result. Emphasizes the prevalence-dependence of PPV/NPV.

mims-harvard By mims-harvard schedule Updated 6/4/2026

name: tooluniverse-diagnostic-test-evaluation description: Diagnostic test / biomarker accuracy — sensitivity, specificity, PPV, NPV, likelihood ratios, accuracy from a 2x2 table; ROC curve, AUC, and the optimal cutoff (Youden) for a continuous biomarker; and post-test probability via Bayes. Use when you have test results vs a gold standard (binary 2x2, or a continuous score + true labels) and need to judge how good the test is, pick a threshold, or compute the probability of disease given a result. Emphasizes the prevalence-dependence of PPV/NPV. disable-model-invocation: true

Diagnostic Test / Biomarker Accuracy Evaluation

Judge how well a test or biomarker discriminates disease — at a fixed cutoff (2×2) or across all cutoffs (ROC) — and turn a result into a probability of disease.

Which case are you in?

You have… Go to
A 2×2 table (TP/FP/TN/FN) at a fixed cutoff Step 1 (Epidemiology_diagnostic)
A continuous biomarker score + true labels Step 2 (ROC / AUC / Youden, Python)
A test's sens/spec + a patient's pre-test probability Step 3 (Epidemiology_bayesian)

Step 1 — Fixed-cutoff metrics from a 2×2 table

tu run Epidemiology_diagnostic '{"operation":"diagnostic","tp":90,"fp":10,"tn":180,"fn":20}'

Returns sensitivity, specificity, PPV, NPV, accuracy, LR_pos, LR_neg, and the sample prevalence.

Metric Question it answers Depends on prevalence?
Sensitivity = TP/(TP+FN) Of those WITH disease, what fraction test positive? No
Specificity = TN/(TN+FP) Of those WITHOUT disease, what fraction test negative? No
PPV = TP/(TP+FP) If positive, what's the chance of disease? Yes — strongly
NPV = TN/(TN+FN) If negative, what's the chance of being disease-free? Yes
LR+ = sens/(1−spec) How much a positive raises the odds of disease No
LR− = (1−sens)/spec How much a negative lowers the odds No

The PPV/NPV trap. Sensitivity and specificity are properties of the test; PPV and NPV depend on the disease prevalence in the tested population. A test with great sens/spec has poor PPV in a low-prevalence (screening) setting. Never quote PPV/NPV from a case-control design (its 50/50 prevalence is artificial) — compute them for the real-world prevalence with Epidemiology_bayesian (Step 3). Report sensitivity, specificity, and likelihood ratios as the prevalence-independent summary.

Step 2 — ROC / AUC / optimal cutoff for a continuous biomarker

When the test is a continuous score, evaluate across all thresholds:

python skills/tooluniverse-diagnostic-test-evaluation/scripts/roc_analysis.py --input scores.csv
# scores.csv columns: label (1=disease, 0=healthy), score (continuous biomarker)

It reports AUC (with a bootstrap 95% CI), the Youden-optimal cutoff (max sensitivity+specificity−1) and its sens/spec, and a text ROC curve.

AUC Discrimination
0.5 no better than chance
0.7–0.8 acceptable
0.8–0.9 excellent
>0.9 outstanding
  • The Youden cutoff weights sensitivity and specificity equally; if false negatives and false positives have different costs, pick the threshold from the clinical tradeoff, not Youden.
  • Once you choose a cutoff, build its 2×2 and run Step 1 for the fixed-cutoff metrics at that operating point.

Step 3 — Post-test probability (Bayes)

Turn a result into the probability of disease for a given pre-test probability/prevalence:

tu run Epidemiology_bayesian '{"operation":"bayesian","prevalence":0.10,
  "sensitivity":0.90,"specificity":0.95,"test_result":"positive"}'

Returns pre_test_odds, the LR, and post_test_probability. This is how you get the real-world PPV: plug the true prevalence in. (Example: a 90%/95% test at 10% prevalence gives a post-positive probability of only ~67%, not 95%.)

Gotchas (state these)

  • PPV/NPV without a stated prevalence are meaningless — always give the prevalence they assume.
  • AUC ignores the operating point. A high AUC doesn't tell you the test is useful at the threshold you'll actually use — report sens/spec at the chosen cutoff too.
  • Class imbalance. With very few positives, ROC/AUC can look good while PPV is poor; consider a precision-recall curve and always report PPV at the real prevalence.
  • Spectrum bias. Sens/spec measured on clearly-sick vs clearly-healthy subjects overestimate real-world performance on borderline cases.
  • Single cutoff chosen on the same data it's evaluated on is optimistic — validate the threshold on a held-out set.

Honest limitations

  • These are discrimination/accuracy metrics, not calibration — a well-discriminating model can still output poorly-calibrated probabilities.
  • A single AUC compares nothing; to compare two tests on the same patients, use a paired AUC test (DeLong) — beyond the basic script here.

Related skills

  • tooluniverse-statistical-modeling — logistic regression that produces the score, ORs.
  • tooluniverse-epidemiological-analysis — population-level risk, screening program metrics.
  • tooluniverse-meta-analysis — pool diagnostic accuracy across studies.
Install via CLI
npx skills add https://github.com/mims-harvard/ToolUniverse --skill tooluniverse-diagnostic-test-evaluation
Repository Details
star Stars 1,459
call_split Forks 224
navigation Branch main
article Path SKILL.md
Occupations
More from Creator
mims-harvard
mims-harvard Explore all skills →