name: ab-test-setup description: Design A/B tests, split tests, multivariate experiments with statistical rigor. Triggers on "A/B test", "split test", "experiment", "test variant", "thử nghiệm", "test trang".
A/B Test Setup
Design statistically valid experiments that produce actionable results. Covers hypothesis formulation, sample size calculation, test duration, variant design, and result interpretation. Addresses low-traffic realities common in Vietnam and emerging markets.
Initial Assessment
Before designing any test, determine:
- What is being tested? Headline, CTA, layout, pricing, email subject, ad creative, or full page?
- Baseline metric? Current conversion rate, CTR, or engagement rate for the control.
- Traffic volume? Monthly unique visitors or impressions on the element being tested. Critical for feasibility.
- Minimum detectable effect (MDE)? What's the smallest improvement worth detecting? 5%? 10%? 20%?
- Testing tool? Google Optimize (sunset — use alternatives), VWO, Optimizely, PostHog, or manual split via UTM?
- Risk tolerance? Can the business afford a losing variant running for 2-4 weeks?
Core Methodology
Hypothesis Framework
Every test starts with a structured hypothesis. Never test without one.
Template: "Changing [element] from [current] to [variant] will [increase/decrease] [metric] by [expected %] because [rationale]."
Good hypothesis: "Changing the CTA from 'Đăng ký ngay' to 'Dùng thử miễn phí 7 ngày' will increase signup rate by 15% because reducing perceived commitment lowers friction."
Bad hypothesis: "Let's test a new button color to see if it helps." (No metric, no rationale, no expected effect.)
Sample Size Calculation
Use these parameters to determine required sample size per variant:
| Parameter | Definition | Typical Value |
|---|---|---|
| Baseline rate | Current conversion rate | Varies (e.g., 3%) |
| MDE | Minimum detectable effect | 10-20% relative lift |
| Significance level (α) | False positive tolerance | 0.05 (95% confidence) |
| Power (1-β) | False negative tolerance | 0.80 (80% power) |
Quick reference table (α=0.05, power=0.80):
| Baseline Rate | MDE 10% | MDE 20% | MDE 30% |
|---|---|---|---|
| 1% | 159,000/var | 40,000/var | 18,000/var |
| 3% | 52,000/var | 13,000/var | 6,000/var |
| 5% | 31,000/var | 7,800/var | 3,500/var |
| 10% | 14,700/var | 3,700/var | 1,700/var |
| 20% | 6,400/var | 1,600/var | 730/var |
Read as: "With a 3% baseline and wanting to detect a 20% relative lift (3% → 3.6%), need ~13,000 visitors per variant."
Test Duration Rules
- Minimum: 1 full business cycle (7 days minimum — captures weekday + weekend behavior)
- Maximum: 4 weeks. Beyond this, external factors contaminate results.
- Calculate: Required sample size ÷ daily traffic to tested page = minimum days
- Never stop early because a variant "looks like it's winning" — let it reach full sample size
Result Interpretation
| Outcome | Action |
|---|---|
| Variant wins with p < 0.05 and practical significance | Implement variant |
| Variant wins with p < 0.05 but tiny absolute lift (<0.5%) | Likely not worth implementing — too small to matter |
| No significant difference (p > 0.05) | Inconclusive — not proof that variants are equal. Consider larger MDE or more traffic. |
| Control wins | Keep control. Document learning. Test a different hypothesis. |
Workflow
Scenario 1: Design a Full A/B Test
- Load context: which page/element, current metrics, traffic volume
- Formulate hypothesis using the template
- Calculate required sample size using the quick reference table
- Check feasibility: daily traffic × desired test duration ≥ required sample × 2 (for 2 variants)
- If infeasible, apply low-traffic strategies (see Vietnam section below)
- Design variants:
- Control (A): current version, unchanged
- Variant (B): one change only. Never change multiple elements simultaneously.
- If multivariate: explicitly declare interaction effects being tested
- Define primary metric (one only) and secondary metrics (2-3 supporting)
- Specify traffic split: 50/50 for standard, 90/10 for high-risk changes
- Document test plan
- Save to
assets/reports/ab-test-[name]-YYYY-MM-DD.md
Scenario 2: Generate Test Ideas
- Load the page or flow being optimized
- Identify high-impact elements to test (prioritize by traffic × conversion potential):
- Headlines and value propositions
- CTA text, color, placement
- Form length and field order
- Social proof placement and format
- Pricing display and framing
- Image vs no image, image type
- Score each idea on ICE: Impact (1-10), Confidence (1-10), Ease (1-10)
- Rank by ICE score. Top 3-5 become the test backlog.
- Save to
assets/reports/test-backlog-[page]-YYYY-MM-DD.md
Scenario 3: Analyze Test Results
- Collect final data: visitors, conversions, rate per variant
- Calculate: observed lift, confidence interval, p-value
- Check for:
- Sample ratio mismatch (SRM) — are variants receiving equal traffic?
- Novelty effect — did variant performance decay over time?
- Segment differences — did the effect vary by device, geography, or source?
- Determine outcome using the Result Interpretation table
- Document: hypothesis, setup, results, decision, learnings
- Save to
assets/reports/ab-test-results-[name]-YYYY-MM-DD.md
Vietnam-Specific Testing Realities
Low-Traffic Challenge
Most Vietnamese SMB websites receive 1,000-10,000 monthly visitors. This makes traditional A/B testing difficult:
Problem: At 5,000 monthly visitors and 3% baseline conversion, detecting a 20% lift requires 13,000 visitors per variant = 26,000 total = 5+ months. Impractical.
Solutions for low traffic:
Increase MDE tolerance. Accept detecting only 30-50% relative lifts. Test bold changes, not subtle tweaks. "Đăng ký ngay" vs "Dùng thử miễn phí 7 ngày" (big change) instead of button color red vs blue (tiny change).
Test on higher-traffic pages first. Homepage, pricing page, or high-traffic blog posts.
Use before/after testing instead of simultaneous split. Run Control for 2 weeks, Variant for 2 weeks, compare. Less rigorous but actionable with low traffic.
Test in paid traffic. Run identical ad sets pointing to different landing pages. Traffic is controllable and can be scaled.
Aggregate micro-conversions. Instead of testing purchase rate (rare event), test add-to-cart, scroll depth, or time-on-page (more frequent events).
Bandit testing. Use multi-armed bandit algorithms that shift traffic toward the winning variant faster. Less statistical rigor but better for low-traffic optimization.
Free and Low-Cost Testing Tools for Vietnam
| Tool | Cost | Best For |
|---|---|---|
| Google Analytics 4 (Experiments) | Free | Basic A/B via GA4 audiences |
| PostHog | Free tier (1M events) | Full-featured, self-hosted option |
| Microsoft Clarity | Free | Heatmaps + session recordings (not A/B, but insight tool) |
| Splitbee / Vercel Analytics | Free tier | Simple split testing for developers |
| Manual UTM split | Free | Two landing pages, split traffic via UTM parameters |
| Facebook Ads A/B | Free (within ad spend) | Creative and audience testing |
Vietnamese User Behavior in Tests
- Mobile dominance: 85%+ of Vietnamese web traffic is mobile. Always test mobile-first. A desktop-only test is meaningless.
- Scroll behavior: Vietnamese users scroll aggressively — below-fold content gets seen. Don't assume "above the fold" is the only battleground.
- Trust signals move needles: Adding "Cam kết hoàn tiền", SĐT hotline, or Zalo button often produces larger lifts than copy changes.
- Price sensitivity: Any test involving pricing or discounts will produce outsized effects. Be careful — discount tests often win but destroy margins.
Output Specification
- Format: Markdown with hypothesis, parameters, timeline, and variant descriptions
- Location:
assets/reports/ - Naming:
ab-test-[name]-YYYY-MM-DD.md,test-backlog-[page]-YYYY-MM-DD.md,ab-test-results-[name]-YYYY-MM-DD.md
Quality Checklist
- Hypothesis written in structured template format
- Sample size calculated with stated parameters (baseline, MDE, α, power)
- Feasibility checked against actual traffic volume
- Low-traffic strategy applied if standard test is infeasible
- Only one variable changed per variant (unless explicit MVT)
- Primary metric defined (one only)
- Test duration includes full weekly cycle (min 7 days)
- Mobile-first consideration for Vietnamese audience
- Output saved to assets/reports/ with correct naming
Related Skills
form-cro— Test ideas for form optimizationmarketing-psychology— Behavioral triggers to test as variantsanalytics— Tracking setup for experiment measurementcopywriting— Writing variant copy for headlines and CTAs