ab-test-setup

star 0

Design A/B tests, split tests, multivariate experiments with statistical rigor. Triggers on "A/B test", "split test", "experiment", "test variant", "thử nghiệm", "test trang".

HoangXuyen-STEM By HoangXuyen-STEM schedule Updated 2/27/2026

name: ab-test-setup description: Design A/B tests, split tests, multivariate experiments with statistical rigor. Triggers on "A/B test", "split test", "experiment", "test variant", "thử nghiệm", "test trang".

A/B Test Setup

Design statistically valid experiments that produce actionable results. Covers hypothesis formulation, sample size calculation, test duration, variant design, and result interpretation. Addresses low-traffic realities common in Vietnam and emerging markets.

Initial Assessment

Before designing any test, determine:

  1. What is being tested? Headline, CTA, layout, pricing, email subject, ad creative, or full page?
  2. Baseline metric? Current conversion rate, CTR, or engagement rate for the control.
  3. Traffic volume? Monthly unique visitors or impressions on the element being tested. Critical for feasibility.
  4. Minimum detectable effect (MDE)? What's the smallest improvement worth detecting? 5%? 10%? 20%?
  5. Testing tool? Google Optimize (sunset — use alternatives), VWO, Optimizely, PostHog, or manual split via UTM?
  6. Risk tolerance? Can the business afford a losing variant running for 2-4 weeks?

Core Methodology

Hypothesis Framework

Every test starts with a structured hypothesis. Never test without one.

Template: "Changing [element] from [current] to [variant] will [increase/decrease] [metric] by [expected %] because [rationale]."

Good hypothesis: "Changing the CTA from 'Đăng ký ngay' to 'Dùng thử miễn phí 7 ngày' will increase signup rate by 15% because reducing perceived commitment lowers friction."

Bad hypothesis: "Let's test a new button color to see if it helps." (No metric, no rationale, no expected effect.)

Sample Size Calculation

Use these parameters to determine required sample size per variant:

Parameter Definition Typical Value
Baseline rate Current conversion rate Varies (e.g., 3%)
MDE Minimum detectable effect 10-20% relative lift
Significance level (α) False positive tolerance 0.05 (95% confidence)
Power (1-β) False negative tolerance 0.80 (80% power)

Quick reference table (α=0.05, power=0.80):

Baseline Rate MDE 10% MDE 20% MDE 30%
1% 159,000/var 40,000/var 18,000/var
3% 52,000/var 13,000/var 6,000/var
5% 31,000/var 7,800/var 3,500/var
10% 14,700/var 3,700/var 1,700/var
20% 6,400/var 1,600/var 730/var

Read as: "With a 3% baseline and wanting to detect a 20% relative lift (3% → 3.6%), need ~13,000 visitors per variant."

Test Duration Rules

  • Minimum: 1 full business cycle (7 days minimum — captures weekday + weekend behavior)
  • Maximum: 4 weeks. Beyond this, external factors contaminate results.
  • Calculate: Required sample size ÷ daily traffic to tested page = minimum days
  • Never stop early because a variant "looks like it's winning" — let it reach full sample size

Result Interpretation

Outcome Action
Variant wins with p < 0.05 and practical significance Implement variant
Variant wins with p < 0.05 but tiny absolute lift (<0.5%) Likely not worth implementing — too small to matter
No significant difference (p > 0.05) Inconclusive — not proof that variants are equal. Consider larger MDE or more traffic.
Control wins Keep control. Document learning. Test a different hypothesis.

Workflow

Scenario 1: Design a Full A/B Test

  1. Load context: which page/element, current metrics, traffic volume
  2. Formulate hypothesis using the template
  3. Calculate required sample size using the quick reference table
  4. Check feasibility: daily traffic × desired test duration ≥ required sample × 2 (for 2 variants)
  5. If infeasible, apply low-traffic strategies (see Vietnam section below)
  6. Design variants:
    • Control (A): current version, unchanged
    • Variant (B): one change only. Never change multiple elements simultaneously.
    • If multivariate: explicitly declare interaction effects being tested
  7. Define primary metric (one only) and secondary metrics (2-3 supporting)
  8. Specify traffic split: 50/50 for standard, 90/10 for high-risk changes
  9. Document test plan
  10. Save to assets/reports/ab-test-[name]-YYYY-MM-DD.md

Scenario 2: Generate Test Ideas

  1. Load the page or flow being optimized
  2. Identify high-impact elements to test (prioritize by traffic × conversion potential):
    • Headlines and value propositions
    • CTA text, color, placement
    • Form length and field order
    • Social proof placement and format
    • Pricing display and framing
    • Image vs no image, image type
  3. Score each idea on ICE: Impact (1-10), Confidence (1-10), Ease (1-10)
  4. Rank by ICE score. Top 3-5 become the test backlog.
  5. Save to assets/reports/test-backlog-[page]-YYYY-MM-DD.md

Scenario 3: Analyze Test Results

  1. Collect final data: visitors, conversions, rate per variant
  2. Calculate: observed lift, confidence interval, p-value
  3. Check for:
    • Sample ratio mismatch (SRM) — are variants receiving equal traffic?
    • Novelty effect — did variant performance decay over time?
    • Segment differences — did the effect vary by device, geography, or source?
  4. Determine outcome using the Result Interpretation table
  5. Document: hypothesis, setup, results, decision, learnings
  6. Save to assets/reports/ab-test-results-[name]-YYYY-MM-DD.md

Vietnam-Specific Testing Realities

Low-Traffic Challenge

Most Vietnamese SMB websites receive 1,000-10,000 monthly visitors. This makes traditional A/B testing difficult:

Problem: At 5,000 monthly visitors and 3% baseline conversion, detecting a 20% lift requires 13,000 visitors per variant = 26,000 total = 5+ months. Impractical.

Solutions for low traffic:

  1. Increase MDE tolerance. Accept detecting only 30-50% relative lifts. Test bold changes, not subtle tweaks. "Đăng ký ngay" vs "Dùng thử miễn phí 7 ngày" (big change) instead of button color red vs blue (tiny change).

  2. Test on higher-traffic pages first. Homepage, pricing page, or high-traffic blog posts.

  3. Use before/after testing instead of simultaneous split. Run Control for 2 weeks, Variant for 2 weeks, compare. Less rigorous but actionable with low traffic.

  4. Test in paid traffic. Run identical ad sets pointing to different landing pages. Traffic is controllable and can be scaled.

  5. Aggregate micro-conversions. Instead of testing purchase rate (rare event), test add-to-cart, scroll depth, or time-on-page (more frequent events).

  6. Bandit testing. Use multi-armed bandit algorithms that shift traffic toward the winning variant faster. Less statistical rigor but better for low-traffic optimization.

Free and Low-Cost Testing Tools for Vietnam

Tool Cost Best For
Google Analytics 4 (Experiments) Free Basic A/B via GA4 audiences
PostHog Free tier (1M events) Full-featured, self-hosted option
Microsoft Clarity Free Heatmaps + session recordings (not A/B, but insight tool)
Splitbee / Vercel Analytics Free tier Simple split testing for developers
Manual UTM split Free Two landing pages, split traffic via UTM parameters
Facebook Ads A/B Free (within ad spend) Creative and audience testing

Vietnamese User Behavior in Tests

  • Mobile dominance: 85%+ of Vietnamese web traffic is mobile. Always test mobile-first. A desktop-only test is meaningless.
  • Scroll behavior: Vietnamese users scroll aggressively — below-fold content gets seen. Don't assume "above the fold" is the only battleground.
  • Trust signals move needles: Adding "Cam kết hoàn tiền", SĐT hotline, or Zalo button often produces larger lifts than copy changes.
  • Price sensitivity: Any test involving pricing or discounts will produce outsized effects. Be careful — discount tests often win but destroy margins.

Output Specification

  • Format: Markdown with hypothesis, parameters, timeline, and variant descriptions
  • Location: assets/reports/
  • Naming: ab-test-[name]-YYYY-MM-DD.md, test-backlog-[page]-YYYY-MM-DD.md, ab-test-results-[name]-YYYY-MM-DD.md

Quality Checklist

  • Hypothesis written in structured template format
  • Sample size calculated with stated parameters (baseline, MDE, α, power)
  • Feasibility checked against actual traffic volume
  • Low-traffic strategy applied if standard test is infeasible
  • Only one variable changed per variant (unless explicit MVT)
  • Primary metric defined (one only)
  • Test duration includes full weekly cycle (min 7 days)
  • Mobile-first consideration for Vietnamese audience
  • Output saved to assets/reports/ with correct naming

Related Skills

  • form-cro — Test ideas for form optimization
  • marketing-psychology — Behavioral triggers to test as variants
  • analytics — Tracking setup for experiment measurement
  • copywriting — Writing variant copy for headlines and CTAs
Install via CLI
npx skills add https://github.com/HoangXuyen-STEM/antigravity-marketing-kit --skill ab-test-setup
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
HoangXuyen-STEM
HoangXuyen-STEM Explore all skills →