interpreting-biostatistics - SKILL.md Agent Skill

name: interpreting-biostatistics language: en description: Structures statistical analysis interpretation with p-value, confidence interval, and effect size reporting. Use when interpreting study statistics, explaining statistical results, or reviewing biostatistical analyses. tags:

analysis
clinical-research metadata: author: casemark practice_areas:
- Clinical Research
- Biostatistics
- Regulatory Affairs document_types:
- Interpretation Report skill_modes:
- Analysis
- Interpretation

Interpreting Biostatistics

Why This Skill Exists

Statistical results from clinical studies drive regulatory decisions, clinical practice changes, and billion-dollar market outcomes — yet misinterpretation of p-values, confidence intervals, and effect sizes is endemic in medical literature. A statistically significant result is not necessarily clinically meaningful; a non-significant result does not prove no effect. This skill provides a structured framework for interpreting biostatistical outputs from clinical trials and observational studies, following ASA (American Statistical Association) guidelines on p-values, ICH E9/E9(R1) statistical principles, and CONSORT/STROBE reporting standards.

Checkpoint A — Intake and Scoping

Required Intake Questions

What type of study produced these results (RCT, observational cohort, case-control, meta-analysis)?
What is the primary endpoint and its data type (continuous, binary, time-to-event, ordinal, count)?
What statistical test was used for the primary analysis?
Was the study designed as superiority, non-inferiority, or equivalence?
What was the pre-specified alpha level and was it adjusted for multiplicity or interim analyses?
What was the sample size and was the study adequately powered?
How was missing data handled (MMRM, multiple imputation, complete case)?
What analysis population was used (ITT, mITT, per-protocol)?
Are there subgroup analyses, and were they pre-specified?
What is the clinical context — what effect size would be considered clinically meaningful?

Required Source Documents

Statistical Analysis Plan (SAP) or published methods section
Results tables (primary endpoint, secondary endpoints, safety)
Forest plots (for subgroup analyses or meta-analyses)
Kaplan-Meier curves (for time-to-event data)
Study protocol (for design parameters and power calculation)
CONSORT, STROBE, or PRISMA checklist (depending on study type)

Step 1 — Interpret P-Values Correctly

Apply the ASA 2016 statement on statistical significance:

What a P-Value Is

The probability of observing data as extreme as (or more extreme than) the observed data, assuming the null hypothesis is true
It is NOT the probability that the null hypothesis is true
It is NOT the probability that the result occurred by chance
It is NOT the probability that the alternative hypothesis is false

Correct Interpretation Framework

P < alpha (typically 0.05): The observed data are unlikely under the null hypothesis. Reject the null in the context of the pre-specified test. This does not prove the treatment works — it means the data are inconsistent with the null at the chosen alpha level.
P ≥ alpha: The data are not sufficiently inconsistent with the null hypothesis to reject it. This does NOT prove no effect exists — it may reflect inadequate power, large variability, or a true null.
P-values near alpha (e.g., 0.04 vs. 0.06): These represent essentially the same level of evidence — do not treat the alpha boundary as a cliff. Report exact p-values; describe the strength of evidence qualitatively.

Common Misinterpretations to Flag

"The treatment showed a trend toward significance (p=0.08)" — this is not a valid statistical inference; either the result meets the pre-specified threshold or it does not
"The study proved the drug is safe" — clinical studies are powered for efficacy, rarely for safety; absence of evidence is not evidence of absence
"The groups were similar at baseline (all p>0.05)" — hypothesis tests on baseline characteristics are inappropriate; use standardized mean differences
"The subgroup analysis was significant" — unless pre-specified and adequately powered, this is hypothesis-generating only

Step 2 — Interpret Confidence Intervals

Confidence intervals provide more information than p-values alone:

Correct Interpretation

A 95% CI means: if the study were repeated many times, 95% of the calculated CIs would contain the true parameter value
The CI width reflects precision — determined by sample size and variability
The CI location relative to the null value (0 for differences, 1 for ratios) determines statistical significance at the corresponding alpha level

Clinical Interpretation Framework

CI excludes the null AND excludes the MCID: Statistically significant AND clinically meaningful
CI excludes the null BUT includes the MCID threshold: Statistically significant BUT uncertain clinical meaningfulness — the true effect could be trivially small
CI includes the null BUT is narrow: Study adequately powered; likely no clinically meaningful effect
CI includes the null AND is wide: Study may be underpowered; cannot conclude effect or no effect — inconclusive
CI entirely above the MCID: Strong evidence of clinically meaningful benefit

Non-Inferiority Interpretation

The treatment is non-inferior if the lower bound of the CI for the difference (or upper bound for the ratio, depending on convention) does not cross the pre-specified non-inferiority margin (delta)
A study can demonstrate both non-inferiority AND superiority if the entire CI excludes zero and the NI margin

Step 3 — Interpret Effect Sizes

Effect sizes quantify the magnitude of the treatment effect:

For Continuous Outcomes

Measure	Calculation	Interpretation Anchors
Mean Difference (MD)	Treatment mean − Control mean	Clinical units; interpret against MCID
Standardized Mean Difference (SMD/Cohen's d)	MD / pooled SD	0.2 small, 0.5 medium, 0.8 large (Cohen's benchmarks — use domain-specific MCIDs when available)

For Binary Outcomes

Measure	Calculation	Interpretation
Risk Difference (RD)	Risk_treatment − Risk_control	Absolute effect; directly yields NNT = 1/RD
Relative Risk (RR)	Risk_treatment / Risk_control	Relative effect; RR < 1 means reduced risk
Odds Ratio (OR)	Odds_treatment / Odds_control	Approximates RR when event rate <10%; overestimates for common outcomes
Number Needed to Treat (NNT)	1 / RD	How many patients need treatment for one additional good outcome

For Time-to-Event Outcomes

Measure	Calculation	Interpretation
Hazard Ratio (HR)	Instantaneous rate_treatment / rate_control	HR < 1 means slower event rate in treatment; assumes proportional hazards
Median Survival Difference	Median_treatment − Median_control	Clinical time units; meaningful when KM curves separate
Restricted Mean Survival Time (RMST)	Area under KM curve up to a specified time	Useful when PH assumption violated; yields time gained

Always Interpret Effect Size in Context

A statistically significant but clinically trivial effect size may not support treatment adoption
A large effect size with a wide CI (few events, small study) needs confirmation
Relative measures (RR, HR) can exaggerate perceived benefit when baseline risk is low — always report absolute measures alongside

Step 4 — Evaluate Multiplicity and Multiple Comparisons

When Multiplicity Adjustment Is Required

Multiple primary endpoints tested at the same alpha
Multiple secondary endpoints with inferential claims
Interim analyses that consume alpha
Multiple subgroup analyses with confirmatory intent

How to Verify Multiplicity Control

Check the SAP: Was a multiplicity strategy pre-specified (hierarchical testing, graphical approach, Hochberg, Holm)?
Verify implementation: Were adjusted p-values or family-wise error rate calculations correctly applied?
Identify claims: Any secondary endpoint presented as "statistically significant" must have been tested within the multiplicity framework; otherwise it is exploratory
Gate-keeping: If the primary endpoint failed, secondary endpoints cannot be formally tested regardless of their nominal p-values

Common Red Flags

"Significant" secondary endpoints when the primary endpoint missed
Cherry-picked subgroups with no pre-specification or multiplicity adjustment
Multiple post-hoc analyses presented as "hypothesis-generating" but framed as though confirmatory

Step 5 — Interpret Subgroup Analyses

Apply the Oxman and Guyatt criteria for credible subgroup effects:

Pre-specified: Was the subgroup hypothesis specified before data analysis?
Limited number: Were only a small number of subgroup comparisons made?
Biological plausibility: Is there a credible biological mechanism for differential effect?
Consistent direction: Is the finding consistent across related studies?
Statistical evidence: Is there a significant interaction test (not just within-subgroup p-values)?
Within vs. between: Do both subgroups show internally consistent results, or does one drive the overall effect?

Interpretation Rules

Report interaction p-values, not within-subgroup p-values
Forest plots should show subgroup estimates with CIs and interaction tests
Subgroup analyses in a single trial are hypothesis-generating unless the trial was specifically designed and powered for subgroup detection
Never recommend treatment decisions based on a single subgroup finding from an exploratory analysis

Step 6 — Synthesize the Statistical Interpretation

Produce an integrated interpretation statement:

Primary result: State the treatment effect (estimate + CI), statistical significance relative to pre-specified alpha, and clinical meaningfulness relative to MCID
Robustness: Did sensitivity analyses (different populations, missing-data methods, alternative models) support the primary conclusion?
Safety balance: What is the NNT for benefit vs. NNH (number needed to harm) for key safety events?
Subgroup consistency: Is the treatment effect consistent across pre-specified subgroups, or are there signals of differential effect?
External consistency: Is the result consistent with prior studies in the same indication?
Limitations: State key statistical limitations (power, missing data, multiplicity, post-hoc nature of certain analyses)

Checkpoint B — Interpretation Review

P-values are correctly interpreted per ASA guidance (not as probability of null)
Confidence intervals are reported for all key effect estimates
Effect sizes are interpreted against clinically meaningful thresholds, not just statistical significance
Non-inferiority/equivalence interpretation uses correct CI boundary and margin
Multiplicity adjustment is verified and correctly applied
Subgroup analyses are assessed using interaction tests and credibility criteria
Missing-data impact on conclusions is explicitly addressed
Absolute and relative effect measures are both presented
Statistical limitations are clearly stated
Clinical context is integrated with statistical findings

Quality Audit

No p-values are described as "trends toward significance"
No baseline-characteristic comparisons use hypothesis tests (SMDs used instead)
Non-significant results are not described as "no difference" without power assessment
NNT/NNH calculations are based on absolute risk differences, not relative measures
Hazard-ratio interpretation includes verification of proportional-hazards assumption
Post-hoc analyses are explicitly labeled throughout
CONSORT or STROBE checklist items for statistical reporting are satisfied
All [VERIFY] flags have been resolved or escalated

Guidelines

Always report effect estimates with confidence intervals — p-values alone are insufficient for clinical decision-making
Statistical significance ≠ clinical significance — a p=0.001 result can be clinically meaningless if the effect size is trivial
Absence of evidence is not evidence of absence — non-significant results in underpowered studies are inconclusive, not negative
Use absolute measures (RD, NNT) alongside relative measures (RR, OR, HR) — relative measures alone can mislead when baseline risks differ
The pre-specified SAP is the contract — any deviation from it must be documented and the analysis labeled as post-hoc
Bayesian analyses require prior specification transparency — assess sensitivity to prior choice
For non-inferiority trials, a finding of non-inferiority does not imply equivalence or superiority — state the conclusion precisely
Forest plots for subgroup analyses must include the interaction p-value, not just within-subgroup p-values
Mark any statistical interpretation that requires clinical-domain judgment with [VERIFY] for clinical-team review
This skill produces statistical interpretation frameworks — final clinical and regulatory conclusions require integrated review by biostatistician, clinician, and regulatory strategist