name: pm-experimentation description: "Design and run A/B tests and experiments: hypothesis design, sample sizing, statistical significance, and common pitfalls. De-risk product decisions with data." user-invocable: true argument-hint: "[feature to test or experimentation question]"
Experimentation & A/B Testing
Help the user design rigorous experiments, calculate sample sizes, avoid common pitfalls, and make data-driven product decisions.
When to Use
- Testing whether a product change improves key metrics
- Deciding between two or more design options
- Validating a hypothesis before full investment
- Building an experimentation culture and process
- Interpreting experiment results and making ship decisions
Step 1: Write a Strong Hypothesis
Hypothesis Template
If we [specific change],
then [metric] will [increase/decrease] by [estimated magnitude],
because [rationale based on user insight or framework].
Examples
WEAK: "Changing the button color will improve conversions."
STRONG: "If we change the CTA from 'Sign Up' to 'Start Free Trial',
then signup rate will increase by 10-15%,
because users expressed hesitation about commitment in interviews,
and 'Free Trial' reduces perceived risk (Loss Aversion, B=MAT Ability)."
Every hypothesis needs:
- A specific, measurable change
- A specific metric to measure
- An estimated direction and magnitude
- A rationale grounded in user insight, not just intuition
Step 2: Choose Experiment Type
| Type | What It Is | Best For | Complexity |
|---|---|---|---|
| A/B test | Two variants, random split | Clear single-variable tests | Low |
| A/B/n test | Multiple variants | Comparing 3-4 options | Medium |
| Multivariate | Multiple variables simultaneously | Optimizing combinations | High |
| Bandit | Dynamic allocation to winning variant | Minimizing regret during test | Medium |
| Switchback | Alternating treatments over time | Marketplace/supply-demand tests | High |
| Quasi-experiment | Non-random assignment (geo, cohort) | When randomization isn't possible | Medium |
Default to A/B. Only use more complex designs when you have a specific reason.
Step 3: Calculate Sample Size
Key Variables
- Baseline conversion rate: [current metric value, e.g., 5%]
- Minimum Detectable Effect (MDE): [smallest change worth detecting, e.g., 10% relative]
- Statistical significance (alpha): 0.05 (standard)
- Statistical power (1-beta): 0.80 (standard)
Quick Reference Table
| Baseline Rate | MDE (Relative) | Sample Size per Variant |
|---|---|---|
| 2% | 10% | ~190,000 |
| 5% | 10% | ~73,000 |
| 10% | 10% | ~34,000 |
| 20% | 10% | ~15,000 |
| 5% | 20% | ~19,000 |
| 10% | 20% | ~9,000 |
Duration Estimate
Test duration = (Sample size per variant x 2) / Daily traffic to test area
Rule of thumb: Run for at least 1 full business cycle (typically 1-2 weeks) to capture weekday/weekend effects, even if you hit sample size earlier.
Step 4: Set Metrics
Primary Metric
The single metric your hypothesis predicts will change. This is your decision metric.
Guardrail Metrics
Metrics that should NOT degrade. If they do, the change has unintended consequences.
Experiment: Simplify signup flow (remove email verification)
Primary metric: Signup completion rate
Guardrail metrics:
- Spam account rate (should not increase)
- D7 retention (should not decrease — removing verification
might attract low-intent users)
- Support tickets for account issues (should not increase)
Counter-Metric
A metric that balances the primary metric against potential gaming.
Step 5: Run the Experiment
Pre-Launch Checklist
- [ ] Hypothesis documented
- [ ] Sample size calculated and traffic confirmed
- [ ] Duration set (minimum 1-2 weeks)
- [ ] Primary metric, guardrails, and counter-metrics defined
- [ ] Randomization unit chosen (user-level, session-level, device-level)
- [ ] No other experiments running on the same population
- [ ] Tracking and logging verified in staging
- [ ] Rollback plan in place
During the Experiment
- Do NOT peek at results and stop early. This inflates false positive rates.
- Monitor guardrail metrics for catastrophic regressions
- Log any external events that could confound results (outage, press, competitor launch)
Step 6: Analyze Results
Decision Framework
Primary metric is statistically significant AND positive?
├── YES → Check guardrail metrics
│ ├── All guardrails hold → SHIP IT
│ └── Guardrail degraded → Investigate. Can you fix the guardrail
│ issue while keeping the primary gain?
│ ├── Yes → Iterate and re-test
│ └── No → Don't ship
└── NO (not significant or negative)
├── Negative and significant → STOP. Diagnose why.
└── Not significant → Underpowered? Run longer. Or the effect
is too small to matter — accept and move on.
Reporting Template
## Experiment Report: [Name]
- **Hypothesis:** [If/then/because]
- **Duration:** [Start - End, N days]
- **Sample:** [N per variant]
- **Result:** [Significant/Not Significant]
| Metric | Control | Variant | Delta | P-value | Significant? |
|--------|---------|---------|-------|---------|-------------|
| Primary: [metric] | [value] | [value] | [+/-X%] | [p] | [Yes/No] |
| Guardrail: [metric] | [value] | [value] | [+/-X%] | [p] | [Yes/No] |
### Decision: [Ship / Don't Ship / Iterate]
### Rationale: [Why]
### Learning: [What we learned regardless of outcome]
When NOT to A/B Test
| Situation | Why | Alternative |
|---|---|---|
| Very low traffic (<1,000/week) | Can't reach significance in reasonable time | User interviews, usability tests |
| One-way door decisions | Can't easily undo | Qualitative research, prototypes |
| Ethical concerns | Testing harm on one group | Expert review, staged rollout |
| Obvious improvements | Bug fix, broken UI, accessibility | Just ship it |
| Infrastructure changes | No user-facing metric to test | Technical benchmarks |
Common Mistakes
- Peeking and stopping early — Checking results daily and stopping when p < 0.05 inflates false positives to 20-30%. Commit to the full duration.
- Underpowered tests — Running a test with 500 users when you need 50,000 proves nothing. Calculate sample size BEFORE starting.
- Multiple comparisons — Testing 10 metrics and celebrating the one that's significant. Apply Bonferroni correction or pre-register your primary metric.
- Novelty effect — Users engage with anything new. Run tests for 2+ weeks and watch for decay in the variant's advantage.
- Simpson's Paradox — A test can be positive overall but negative in every segment (or vice versa). Always segment results by key dimensions.
- No learning repository — If experiment results live in Slack threads, you'll re-run the same tests. Maintain a searchable experiment log.
- Only testing small changes — Button colors and copy tweaks have ceilings. The biggest wins come from testing fundamentally different approaches.