pm-experimentation

star 1

Design and run A/B tests and experiments: hypothesis design, sample sizing, statistical significance, and common pitfalls. De-risk product decisions with data.

akhil08agrawal By akhil08agrawal schedule Updated 2/19/2026

name: pm-experimentation description: "Design and run A/B tests and experiments: hypothesis design, sample sizing, statistical significance, and common pitfalls. De-risk product decisions with data." user-invocable: true argument-hint: "[feature to test or experimentation question]"

Experimentation & A/B Testing

Help the user design rigorous experiments, calculate sample sizes, avoid common pitfalls, and make data-driven product decisions.

When to Use

  • Testing whether a product change improves key metrics
  • Deciding between two or more design options
  • Validating a hypothesis before full investment
  • Building an experimentation culture and process
  • Interpreting experiment results and making ship decisions

Step 1: Write a Strong Hypothesis

Hypothesis Template

If we [specific change],
then [metric] will [increase/decrease] by [estimated magnitude],
because [rationale based on user insight or framework].

Examples

WEAK: "Changing the button color will improve conversions."
STRONG: "If we change the CTA from 'Sign Up' to 'Start Free Trial',
then signup rate will increase by 10-15%,
because users expressed hesitation about commitment in interviews,
and 'Free Trial' reduces perceived risk (Loss Aversion, B=MAT Ability)."

Every hypothesis needs:

  • A specific, measurable change
  • A specific metric to measure
  • An estimated direction and magnitude
  • A rationale grounded in user insight, not just intuition

Step 2: Choose Experiment Type

Type What It Is Best For Complexity
A/B test Two variants, random split Clear single-variable tests Low
A/B/n test Multiple variants Comparing 3-4 options Medium
Multivariate Multiple variables simultaneously Optimizing combinations High
Bandit Dynamic allocation to winning variant Minimizing regret during test Medium
Switchback Alternating treatments over time Marketplace/supply-demand tests High
Quasi-experiment Non-random assignment (geo, cohort) When randomization isn't possible Medium

Default to A/B. Only use more complex designs when you have a specific reason.

Step 3: Calculate Sample Size

Key Variables

- Baseline conversion rate: [current metric value, e.g., 5%]
- Minimum Detectable Effect (MDE): [smallest change worth detecting, e.g., 10% relative]
- Statistical significance (alpha): 0.05 (standard)
- Statistical power (1-beta): 0.80 (standard)

Quick Reference Table

Baseline Rate MDE (Relative) Sample Size per Variant
2% 10% ~190,000
5% 10% ~73,000
10% 10% ~34,000
20% 10% ~15,000
5% 20% ~19,000
10% 20% ~9,000

Duration Estimate

Test duration = (Sample size per variant x 2) / Daily traffic to test area

Rule of thumb: Run for at least 1 full business cycle (typically 1-2 weeks) to capture weekday/weekend effects, even if you hit sample size earlier.

Step 4: Set Metrics

Primary Metric

The single metric your hypothesis predicts will change. This is your decision metric.

Guardrail Metrics

Metrics that should NOT degrade. If they do, the change has unintended consequences.

Experiment: Simplify signup flow (remove email verification)
Primary metric: Signup completion rate
Guardrail metrics:
  - Spam account rate (should not increase)
  - D7 retention (should not decrease — removing verification
    might attract low-intent users)
  - Support tickets for account issues (should not increase)

Counter-Metric

A metric that balances the primary metric against potential gaming.

Step 5: Run the Experiment

Pre-Launch Checklist

- [ ] Hypothesis documented
- [ ] Sample size calculated and traffic confirmed
- [ ] Duration set (minimum 1-2 weeks)
- [ ] Primary metric, guardrails, and counter-metrics defined
- [ ] Randomization unit chosen (user-level, session-level, device-level)
- [ ] No other experiments running on the same population
- [ ] Tracking and logging verified in staging
- [ ] Rollback plan in place

During the Experiment

  • Do NOT peek at results and stop early. This inflates false positive rates.
  • Monitor guardrail metrics for catastrophic regressions
  • Log any external events that could confound results (outage, press, competitor launch)

Step 6: Analyze Results

Decision Framework

Primary metric is statistically significant AND positive?
├── YES → Check guardrail metrics
│   ├── All guardrails hold → SHIP IT
│   └── Guardrail degraded → Investigate. Can you fix the guardrail
│       issue while keeping the primary gain?
│       ├── Yes → Iterate and re-test
│       └── No → Don't ship
└── NO (not significant or negative)
    ├── Negative and significant → STOP. Diagnose why.
    └── Not significant → Underpowered? Run longer. Or the effect
        is too small to matter — accept and move on.

Reporting Template

## Experiment Report: [Name]
- **Hypothesis:** [If/then/because]
- **Duration:** [Start - End, N days]
- **Sample:** [N per variant]
- **Result:** [Significant/Not Significant]

| Metric | Control | Variant | Delta | P-value | Significant? |
|--------|---------|---------|-------|---------|-------------|
| Primary: [metric] | [value] | [value] | [+/-X%] | [p] | [Yes/No] |
| Guardrail: [metric] | [value] | [value] | [+/-X%] | [p] | [Yes/No] |

### Decision: [Ship / Don't Ship / Iterate]
### Rationale: [Why]
### Learning: [What we learned regardless of outcome]

When NOT to A/B Test

Situation Why Alternative
Very low traffic (<1,000/week) Can't reach significance in reasonable time User interviews, usability tests
One-way door decisions Can't easily undo Qualitative research, prototypes
Ethical concerns Testing harm on one group Expert review, staged rollout
Obvious improvements Bug fix, broken UI, accessibility Just ship it
Infrastructure changes No user-facing metric to test Technical benchmarks

Common Mistakes

  1. Peeking and stopping early — Checking results daily and stopping when p < 0.05 inflates false positives to 20-30%. Commit to the full duration.
  2. Underpowered tests — Running a test with 500 users when you need 50,000 proves nothing. Calculate sample size BEFORE starting.
  3. Multiple comparisons — Testing 10 metrics and celebrating the one that's significant. Apply Bonferroni correction or pre-register your primary metric.
  4. Novelty effect — Users engage with anything new. Run tests for 2+ weeks and watch for decay in the variant's advantage.
  5. Simpson's Paradox — A test can be positive overall but negative in every segment (or vice versa). Always segment results by key dimensions.
  6. No learning repository — If experiment results live in Slack threads, you'll re-run the same tests. Maintain a searchable experiment log.
  7. Only testing small changes — Button colors and copy tweaks have ceilings. The biggest wins come from testing fundamentally different approaches.
Install via CLI
npx skills add https://github.com/akhil08agrawal/product-management-skills --skill pm-experimentation
Repository Details
star Stars 1
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator
akhil08agrawal
akhil08agrawal Explore all skills →