name: stats-consult description: Recommends the appropriate statistical test given a data description and research question, including assumption checks, alternatives if violated, sample-size guidance, and effect-size reporting. Use when the user asks "what test should I use", "is this t-test the right choice", "how many subjects do I need", "what's the right way to analyse this dataset", or describes a dataset + hypothesis without a chosen method. short_desc: choose statistical test + assumptions + power analysis keywords: ["t-test", ANOVA, "power analysis", "Mann-Whitney", "chi-squared", Wilcoxon, "what test should I use", "analyze this dataset", "sample size guidance"] model: opus effort: high allowed-tools: Read, Write, WebSearch, Bash
Stats Consult (Opus)
Purpose: Triage a research question + dataset description into a defensible statistical analysis plan. Recommend test family, list assumptions to verify, give a sample-size estimate, and call out the alternatives if assumptions fail.
Model: Opus 4.7 — strong reasoning for choosing between methods; cheaper than Opus for this routine task.
When to invoke autonomously:
- The user has data and a question but no chosen method.
- The user proposes a method and you suspect a better one (e.g. they wrote "t-test" on count data, or "ANOVA" on repeated measures).
- The user asks about power or sample size.
- The user describes "we did 30 tests and 4 were significant" without mentioning correction.
Do NOT invoke when:
- The user already has a working analysis and just wants help running a function (use
@coder). - The question is purely about a probability concept (Bayesian vs frequentist philosophy) — answer in conversation.
Usage
/stats-consult I have continuous fluorescence intensity from 3 cell lines, 4 dishes per line, 20 cells per dish. Do they differ?
/stats-consult Comparing pre/post treatment in 12 patients on a Likert pain scale. What test?
/stats-consult I screened 80 metabolites with t-tests, 9 came out p<0.05 — what now?
/stats-consult How many mice per group to detect a 25% reduction in tumour volume with 80% power?
What This Skill Does
1. Triage Pipeline
- Identify outcome variable scale — continuous, count, binary, ordinal, time-to-event, compositional, or circular. Ask if unclear.
- Identify design — independent samples, paired/repeated, nested/hierarchical, or time series.
- Count groups and predictors.
- Pick the parametric default from the [[Hypothesis Testing Decision Tree]] table.
- List the assumptions the user must verify before reporting (residual plots, not raw-data normality).
- Name the non-parametric / robust alternative if assumptions fail.
- Flag multiple-comparison issues if there are >5 outcomes or >3 groups.
- Estimate sample size when asked or when the design suggests an underpowered study.
- Specify effect-size reporting in natural units, with 95% CI.
2. Power and Sample Size Guidance
Practical defaults the skill applies:
- For a two-group continuous comparison at α=0.05, 80% power, the rule-of-thumb sample size is:
- $n \approx 16 / d^2$ per group, where $d$ is Cohen's $d$ (standardised effect).
- $d = 0.2$ (small) → ~400/group; $d = 0.5$ (medium) → ~64/group; $d = 0.8$ (large) → ~25/group.
- For paired tests with within-subject correlation $\rho$: divide the unpaired $n$ by $1 - \rho$ → fewer subjects when measurements are correlated.
- For proportions, use
statsmodels.stats.proportion.samplesize_proportions_2indep_onetailor G*Power's $z$-test for two proportions; do not use the continuous-data rule. - For survival: use Schoenfeld's formula for log-rank $n = (z_{\alpha/2} + z_{\beta})^2 / (p_1 p_2 \log^2 \mathrm{HR})$, where $p_i$ are the allocation fractions.
- For mixed-effects: simulation-based power (e.g.
simrin R,pymer4or custom) — closed-form is unreliable.
Always sanity-check the resulting $n$ against feasibility (animals available, recruitment rate, budget). If $n$ is impossible, recommend a more focused hypothesis or a within-subject design.
3. Multiple-Comparison Decisions
| Situation | Recommend |
|---|---|
| 2-10 planned comparisons, want strict control | Holm-Bonferroni (free upgrade over Bonferroni) |
| 10-1000 outcomes, accept some false positives | Benjamini-Hochberg FDR (standard in genomics) |
| Many exchangeable tests (gene expression, GWAS) | Empirical Bayes shrinkage (limma; brms) — outperforms BH |
| Pairwise group comparisons after omnibus ANOVA | Tukey HSD (parametric) or Dunn (after Kruskal-Wallis) |
| Repeated post-hoc on the same data with adaptive choices | Permutation maxT or simulation — only honest answer |
4. Assumption-Check Checklist (Always Returned)
For the recommended test, the skill returns the specific diagnostics:
- Continuous outcome: Q-Q plot of residuals (not raw data); residuals-vs-fitted for heteroscedasticity; Cook's distance for influential points.
- Count outcome: dispersion test (variance/mean ≈ 1 for Poisson, otherwise negative binomial); zero-inflation visual.
- Binary / logistic: Hosmer-Lemeshow goodness-of-fit (deprecated for large $n$; use calibration plot); separation check.
- Survival / Cox: Schoenfeld residuals against time for proportional-hazards check; log-log plot.
- Mixed models: Q-Q plot of conditional residuals and of random-effect BLUPs; intraclass correlation should be positive and large enough to matter.
- Linear regression: VIF for multicollinearity (>10 flag); partial residual plots for linearity.
5. When to Switch to Bayesian
Recommend Bayesian (pymc, numpyro, brms, Stan) when:
- Prior information from previous studies materially improves precision.
- Sequential analysis or interim looks are planned.
- Direct probability statements about the parameter are needed for decision-making.
- Sample size is tiny and the prior provides useful regularisation.
- Hierarchical / partial-pooling structure is natural.
Recommend frequentist when:
- A peer-reviewed convention exists in the field (e.g. clinical trial primary analysis) and reviewers will demand it.
- Prior elicitation would be contentious and the analysis must be "objective".
- Time and computation are constrained — frequentist methods are usually orders of magnitude faster.
Output Format
The skill returns a structured plan:
## Statistical Analysis Plan
**Question**: [restate user question in 1 sentence]
**Data shape**:
- Outcome: [e.g. continuous, fluorescence intensity in arbitrary units]
- Design: [e.g. nested — 20 cells in 4 dishes in 3 cell lines]
- Experimental unit: [e.g. dish — biological replicate]
- Effective n per group: [e.g. 4]
**Recommended test**: [name, e.g. linear mixed model with random intercept for dish, fixed effect for cell line]
**Why**: [1-3 sentence rationale referencing the design]
**Implementation**:
```python
# concrete code in the user's stack (Python statsmodels / R lme4 / etc.)
Assumptions to verify:
- [diagnostic 1 with how to compute]
- [diagnostic 2]
- ...
If assumptions fail: [specific fallback — e.g. log-transform outcome, or switch to GAM, or bootstrap CIs on the mixed model]
Multiple-comparison handling: [explicit method if applicable]
Effect size to report: [e.g. cell-line contrasts in raw units with 95% CIs from confint(); Cohen's d as supplement]
Sample size check: [is the study adequately powered? if no, what would be needed?]
Reporting template:
"[Outcome] differed across [groups] (mixed model, F(df1, df2) = X, p = Y). Tukey-adjusted contrast between group A and B: estimate = Z [95% CI low, high], p = W."
Pitfalls flagged: [e.g. "treating cells as the unit instead of dishes inflates n 20-fold → severe Type I"]
## Decision Tree
Did the user describe an outcome variable? ├─ NO → Ask what they measured. The scale dictates everything. └─ YES → Continue.
Is the design clearly identified (paired? nested? time series?) ├─ NO → Ask. Wrong assumed independence is the most common error. └─ YES → Continue.
Apply the decision tree from knowledge/concepts/hypothesis-testing-decision-tree.md
Did the user mention "we tested N things"? ├─ YES → Multiple-comparison handling is mandatory. └─ NO → Check the design for hidden multiplicity (multiple time points, multiple genes/metabolites).
Is the sample size obviously small for the claimed effect? ├─ YES → Run a power calculation; state the minimum detectable effect. └─ NO → Note effect-size reporting requirements.
## Hard Rules
1. **Welch's t-test is the default two-sample t-test** — not Student's. Robust to unequal variance with negligible cost when variances are equal.
2. **Test residual normality, not raw-data normality.** The assumption is on residuals.
3. **Never recommend a one-tailed test** unless the user has explicitly preregistered a directional hypothesis. Default to two-sided.
4. **Never recommend Pearson correlation** without first eyeballing the scatter — use Spearman if the relationship is monotonic but non-linear.
5. **Always report effect size with CI.** P-values without effect sizes are uninformative.
6. **For "no difference" claims, recommend equivalence testing** (TOST in `statsmodels.stats.weightstats.ttost_ind` or `parameters::equivalence_test` in R) against a pre-specified SESOI, not just "p > 0.05".
7. **Flag pseudoreplication aggressively.** Multiple cells from one animal, technical replicates, repeated injections — they are not independent biological replicates.
## Integration with Knowledge Graph
The skill leans on these KG nodes:
- [[Hypothesis Testing Decision Tree]] for the test-selection logic.
- [[Common Statistical Pitfalls]] for the failure modes to actively guard against.
- [[Bayesian Inference]] for when to switch paradigms.
- [[Scientific Python Stack 2026]] for tooling recommendations.
After the consultation, if you produced novel guidance worth keeping, save it as a per-project KG node under `knowledge/concepts/stats-<topic>.md`.
## Success Criteria
- Test recommendation matches the data shape AND the design.
- Assumption diagnostics named with how to compute them.
- A robust fallback is named in case assumptions fail.
- Multiple-comparison strategy is appropriate for the number of tests.
- Effect-size reporting is in natural units with CIs.
- Sample-size guidance, when relevant, includes the minimum detectable effect.
- The plan is reproducible: anyone with the data could run it and get the same answer.