result-interpretation - SKILL.md Agent Skill

name: result-interpretation description: Interpret statistical results and decide what to do next category: workflow

Result Interpretation

When to Use This Skill

After running a statistical test
When deciding if a hypothesis is supported or rejected
When planning next steps based on results

Types of Results

1. Positive Finding (Hypothesis Supported)

Criteria:

p-value < significance threshold (typically 0.05)
Effect size is meaningful (not just statistically significant)
Result makes biological sense

What to Do:

Calculate effect size - Don't rely on p-values alone
- Cohen's d for t-tests
- η² (eta-squared) for ANOVA
- Correlation coefficient for associations

Record the finding

update_knowledge_state(
    title="Clear, descriptive title",
    evidence="Statistical details: p-value, effect size, confidence interval",
    interpretation="Biological meaning"
)

Search literature for validation
- Does this align with known biology?
- Are there papers supporting this mechanism?
- What enzymes/pathways are involved?
Generate follow-up hypotheses
- What explains this finding mechanistically?
- What are the downstream consequences?
- What conditions would reverse this effect?

Example:

Result: CDP-Choline Synthesis Index 35.4% higher in hypothermia (p=0.042, η²=0.29)

Interpretation:
- Statistically significant (p<0.05) ✓
- Large effect size (η²=0.29 is substantial) ✓
- Suggests Pcyt1 enzyme bottleneck

Action: Record finding, search "Pcyt1 regulation hypothermia"

2. Negative Finding (Hypothesis Rejected)

This is NOT a failure! Negative results are scientifically valuable.

Criteria:

p-value > significance threshold
OR effect size is trivial even if p<0.05

What to Do:

Document what was ruled out
- Update hypothesis status to "rejected"
- Note the p-value and confidence interval
- Record why this hypothesis seemed plausible
Extract insights from the failure
- What does this non-result tell us?
- What alternative explanations remain?
- Did we learn about the data structure?
Generate alternative hypotheses
- If flux wasn't increased, maybe it's a bottleneck?
- If it's not upstream, maybe it's downstream?
- If it's not pathway A, maybe pathway B?

Example:

Result: Salvage Flux Proxy Index - no difference (F=2.287, p=0.138)

Interpretation:
- Hypothesis rejected (p>0.05)
- But we learned: salvage flux is NOT the explanation
- This rules out 1 of 3 candidate mechanisms

Action: Generate alternative - test for enzymatic bottleneck instead

3. Borderline Result (p=0.06-0.10)

Don't p-hack! But also don't ignore suggestive trends.

What to Do:

Report honestly - "Suggestive but not significant"
Check effect size - Is it meaningful even if not "significant"?
Consider:
- Sample size - underpowered study?
- Measurement noise - need better proxy?
- Confounders - should we stratify?
Don't chase marginal p-values
- Don't add/remove outliers to get p<0.05
- Don't try 10 different tests until one "works"
- Move on to more promising hypotheses

4. Unexpected Result

When results surprise you:

Example: Expected positive correlation, got negative

Check for errors first
- Code bugs?
- Data quality issues?
- Mislabeled variables?
If result is real, this is interesting!
- Unexpected results are often the best discoveries
- Ask: Why would this happen?
- Search literature for mechanisms
Design targeted follow-up
- Can we replicate in subgroups?
- Is there a confounding variable?
- What would explain the reversal?

Interpreting Effect Sizes

Cohen's d (for t-tests):

d = 0.2: Small effect
d = 0.5: Medium effect
d = 0.8: Large effect
d > 1.2: Very large effect

η² (for ANOVA):

η² = 0.01: Small effect
η² = 0.06: Medium effect
η² = 0.14: Large effect

Correlation (r):

|r| = 0.1-0.3: Weak
|r| = 0.3-0.5: Moderate
|r| > 0.5: Strong

Remember: Large effect size with p=0.06 may be more meaningful than tiny effect with p=0.001!

Common Interpretation Mistakes

❌ "p>0.05 means there's no effect"

Wrong! It means we can't rule out chance
Check effect size and confidence intervals

❌ "p<0.05 means it's biologically important"

Wrong! Tiny effects can be "significant" with large N
Always report effect size

❌ "This correlation proves causation"

Wrong! Could be confounded, reverse causation, or coincidence
Use mechanistic reasoning and literature

❌ "Negative results are failures"

Wrong! They're scientifically valuable
They constrain hypotheses and guide investigation

Guarding Against False Discoveries

Multiple Testing Correction

When testing many hypotheses (e.g., thousands of genes), raw p-values are misleading.

Rule: If you ran more than ~20 statistical tests, apply FDR correction before claiming significance.

from statsmodels.stats.multitest import multipletests

# Collect all p-values from your tests
p_values = [p1, p2, p3, ...]

# Apply Benjamini-Hochberg FDR correction
rejected, p_adjusted, _, _ = multipletests(p_values, method='fdr_bh')

# Use p_adjusted < 0.05, not raw p-values

Report: "Of 5,000 genes tested, 127 were significant after FDR correction (adjusted p < 0.05)"

Skeptical Checkpoints

Before recording any finding, ask:

What would a skeptical reviewer say? Identify the weakest part of your evidence.
Did I look for reasons this could fail? Consider trying to disprove your finding before confirming it.
Did I verify data integrity? Do positive/negative controls behave as expected? Are quality metrics reasonable?

Decision Tree: What to Do Next

Statistical test complete
    ├─> p<0.05 AND meaningful effect size
    │   └─> RECORD FINDING, search literature, generate follow-ups
    │
    ├─> p>0.05 (not significant)
    │   └─> HYPOTHESIS REJECTED, generate alternatives, test next hypothesis
    │
    ├─> p=0.05-0.10 (borderline)
    │   └─> NOTE AS SUGGESTIVE, check effect size, move to next hypothesis
    │
    └─> Unexpected/surprising result
        └─> VERIFY (check code), search literature, design targeted follow-up

Example: Full Interpretation

Test: Compare nucleotide salvage precursors (Adenine, Cytidine) across groups

Result:

# Adenine: t=-2.45, p=0.028, d=1.32
# Cytidine: t=-2.87, p=0.013, d=1.54

Interpretation:

Both significantly depleted (p<0.05) ✓
Large effect sizes (d>0.8) ✓
Consistent pattern (both precursors down) ✓

Biological meaning:

Precursors depleted → suggests active consumption
Could indicate increased salvage flux
Or could indicate depleted substrate pool

Next steps:

Record finding: "Salvage precursors depleted in hypothermia"
Search literature: "nucleotide salvage regulation"
Generate hypothesis: "Is salvage flux increased?" (test with product/precursor ratio)

Key Principle

Every result - positive, negative, or unexpected - should generate insight and inform the next step.

Don't just collect p-values. Think about mechanisms.