pm-experimentation - SKILL.md Agent Skill

name: pm-experimentation description: "Design and run A/B tests and experiments: hypothesis design, sample sizing, statistical significance, and common pitfalls. De-risk product decisions with data." user-invocable: true argument-hint: "[feature to test or experimentation question]"

Experimentation & A/B Testing

Help the user design rigorous experiments, calculate sample sizes, avoid common pitfalls, and make data-driven product decisions.

When to Use

Testing whether a product change improves key metrics
Deciding between two or more design options
Validating a hypothesis before full investment
Building an experimentation culture and process
Interpreting experiment results and making ship decisions

Step 1: Write a Strong Hypothesis

Hypothesis Template

If we [specific change],
then [metric] will [increase/decrease] by [estimated magnitude],
because [rationale based on user insight or framework].

Examples

WEAK: "Changing the button color will improve conversions."
STRONG: "If we change the CTA from 'Sign Up' to 'Start Free Trial',
then signup rate will increase by 10-15%,
because users expressed hesitation about commitment in interviews,
and 'Free Trial' reduces perceived risk (Loss Aversion, B=MAT Ability)."

Every hypothesis needs:

A specific, measurable change
A specific metric to measure
An estimated direction and magnitude
A rationale grounded in user insight, not just intuition

Step 2: Choose Experiment Type

Type	What It Is	Best For	Complexity
A/B test	Two variants, random split	Clear single-variable tests	Low
A/B/n test	Multiple variants	Comparing 3-4 options	Medium
Multivariate	Multiple variables simultaneously	Optimizing combinations	High
Bandit	Dynamic allocation to winning variant	Minimizing regret during test	Medium
Switchback	Alternating treatments over time	Marketplace/supply-demand tests	High
Quasi-experiment	Non-random assignment (geo, cohort)	When randomization isn't possible	Medium

Default to A/B. Only use more complex designs when you have a specific reason.

Step 3: Calculate Sample Size

Key Variables

- Baseline conversion rate: [current metric value, e.g., 5%]
- Minimum Detectable Effect (MDE): [smallest change worth detecting, e.g., 10% relative]
- Statistical significance (alpha): 0.05 (standard)
- Statistical power (1-beta): 0.80 (standard)

Quick Reference Table

Baseline Rate	MDE (Relative)	Sample Size per Variant
2%	10%	~190,000
5%	10%	~73,000
10%	10%	~34,000
20%	10%	~15,000
5%	20%	~19,000
10%	20%	~9,000

Duration Estimate

Test duration = (Sample size per variant x 2) / Daily traffic to test area

Rule of thumb: Run for at least 1 full business cycle (typically 1-2 weeks) to capture weekday/weekend effects, even if you hit sample size earlier.

Step 4: Set Metrics

Primary Metric

The single metric your hypothesis predicts will change. This is your decision metric.

Guardrail Metrics

Metrics that should NOT degrade. If they do, the change has unintended consequences.

Experiment: Simplify signup flow (remove email verification)
Primary metric: Signup completion rate
Guardrail metrics:
  - Spam account rate (should not increase)
  - D7 retention (should not decrease — removing verification
    might attract low-intent users)
  - Support tickets for account issues (should not increase)

Counter-Metric

A metric that balances the primary metric against potential gaming.

Step 5: Run the Experiment

Pre-Launch Checklist

- [ ] Hypothesis documented
- [ ] Sample size calculated and traffic confirmed
- [ ] Duration set (minimum 1-2 weeks)
- [ ] Primary metric, guardrails, and counter-metrics defined
- [ ] Randomization unit chosen (user-level, session-level, device-level)
- [ ] No other experiments running on the same population
- [ ] Tracking and logging verified in staging
- [ ] Rollback plan in place

During the Experiment

Do NOT peek at results and stop early. This inflates false positive rates.
Monitor guardrail metrics for catastrophic regressions
Log any external events that could confound results (outage, press, competitor launch)

Step 6: Analyze Results

Decision Framework

Primary metric is statistically significant AND positive?
├── YES → Check guardrail metrics
│   ├── All guardrails hold → SHIP IT
│   └── Guardrail degraded → Investigate. Can you fix the guardrail
│       issue while keeping the primary gain?
│       ├── Yes → Iterate and re-test
│       └── No → Don't ship
└── NO (not significant or negative)
    ├── Negative and significant → STOP. Diagnose why.
    └── Not significant → Underpowered? Run longer. Or the effect
        is too small to matter — accept and move on.

Reporting Template

## Experiment Report: [Name]
- **Hypothesis:** [If/then/because]
- **Duration:** [Start - End, N days]
- **Sample:** [N per variant]
- **Result:** [Significant/Not Significant]

| Metric | Control | Variant | Delta | P-value | Significant? |
|--------|---------|---------|-------|---------|-------------|
| Primary: [metric] | [value] | [value] | [+/-X%] | [p] | [Yes/No] |
| Guardrail: [metric] | [value] | [value] | [+/-X%] | [p] | [Yes/No] |

### Decision: [Ship / Don't Ship / Iterate]
### Rationale: [Why]
### Learning: [What we learned regardless of outcome]

When NOT to A/B Test

Situation	Why	Alternative
Very low traffic (<1,000/week)	Can't reach significance in reasonable time	User interviews, usability tests
One-way door decisions	Can't easily undo	Qualitative research, prototypes
Ethical concerns	Testing harm on one group	Expert review, staged rollout
Obvious improvements	Bug fix, broken UI, accessibility	Just ship it
Infrastructure changes	No user-facing metric to test	Technical benchmarks

Common Mistakes

Peeking and stopping early — Checking results daily and stopping when p < 0.05 inflates false positives to 20-30%. Commit to the full duration.
Underpowered tests — Running a test with 500 users when you need 50,000 proves nothing. Calculate sample size BEFORE starting.
Multiple comparisons — Testing 10 metrics and celebrating the one that's significant. Apply Bonferroni correction or pre-register your primary metric.
Novelty effect — Users engage with anything new. Run tests for 2+ weeks and watch for decay in the variant's advantage.
Simpson's Paradox — A test can be positive overall but negative in every segment (or vice versa). Always segment results by key dimensions.
No learning repository — If experiment results live in Slack threads, you'll re-run the same tests. Maintain a searchable experiment log.
Only testing small changes — Button colors and copy tweaks have ceilings. The biggest wins come from testing fundamentally different approaches.