ab-test-setup - SKILL.md Agent Skill

name: ab-test-setup description: Design A/B tests, split tests, multivariate experiments with statistical rigor. Triggers on "A/B test", "split test", "experiment", "test variant", "thử nghiệm", "test trang".

A/B Test Setup

Design statistically valid experiments that produce actionable results. Covers hypothesis formulation, sample size calculation, test duration, variant design, and result interpretation. Addresses low-traffic realities common in Vietnam and emerging markets.

Initial Assessment

Before designing any test, determine:

What is being tested? Headline, CTA, layout, pricing, email subject, ad creative, or full page?
Baseline metric? Current conversion rate, CTR, or engagement rate for the control.
Traffic volume? Monthly unique visitors or impressions on the element being tested. Critical for feasibility.
Minimum detectable effect (MDE)? What's the smallest improvement worth detecting? 5%? 10%? 20%?
Testing tool? Google Optimize (sunset — use alternatives), VWO, Optimizely, PostHog, or manual split via UTM?
Risk tolerance? Can the business afford a losing variant running for 2-4 weeks?

Core Methodology

Hypothesis Framework

Every test starts with a structured hypothesis. Never test without one.

Template: "Changing [element] from [current] to [variant] will [increase/decrease] [metric] by [expected %] because [rationale]."

Good hypothesis: "Changing the CTA from 'Đăng ký ngay' to 'Dùng thử miễn phí 7 ngày' will increase signup rate by 15% because reducing perceived commitment lowers friction."

Bad hypothesis: "Let's test a new button color to see if it helps." (No metric, no rationale, no expected effect.)

Sample Size Calculation

Use these parameters to determine required sample size per variant:

Parameter	Definition	Typical Value
Baseline rate	Current conversion rate	Varies (e.g., 3%)
MDE	Minimum detectable effect	10-20% relative lift
Significance level (α)	False positive tolerance	0.05 (95% confidence)
Power (1-β)	False negative tolerance	0.80 (80% power)

Quick reference table (α=0.05, power=0.80):

Baseline Rate	MDE 10%	MDE 20%	MDE 30%
1%	159,000/var	40,000/var	18,000/var
3%	52,000/var	13,000/var	6,000/var
5%	31,000/var	7,800/var	3,500/var
10%	14,700/var	3,700/var	1,700/var
20%	6,400/var	1,600/var	730/var

Read as: "With a 3% baseline and wanting to detect a 20% relative lift (3% → 3.6%), need ~13,000 visitors per variant."

Test Duration Rules

Minimum: 1 full business cycle (7 days minimum — captures weekday + weekend behavior)
Maximum: 4 weeks. Beyond this, external factors contaminate results.
Calculate: Required sample size ÷ daily traffic to tested page = minimum days
Never stop early because a variant "looks like it's winning" — let it reach full sample size

Result Interpretation

Outcome	Action
Variant wins with p < 0.05 and practical significance	Implement variant
Variant wins with p < 0.05 but tiny absolute lift (<0.5%)	Likely not worth implementing — too small to matter
No significant difference (p > 0.05)	Inconclusive — not proof that variants are equal. Consider larger MDE or more traffic.
Control wins	Keep control. Document learning. Test a different hypothesis.

Workflow

Scenario 1: Design a Full A/B Test

Load context: which page/element, current metrics, traffic volume
Formulate hypothesis using the template
Calculate required sample size using the quick reference table
Check feasibility: daily traffic × desired test duration ≥ required sample × 2 (for 2 variants)
If infeasible, apply low-traffic strategies (see Vietnam section below)
Design variants:
- Control (A): current version, unchanged
- Variant (B): one change only. Never change multiple elements simultaneously.
- If multivariate: explicitly declare interaction effects being tested
Define primary metric (one only) and secondary metrics (2-3 supporting)
Specify traffic split: 50/50 for standard, 90/10 for high-risk changes
Document test plan
Save to assets/reports/ab-test-[name]-YYYY-MM-DD.md

Scenario 2: Generate Test Ideas

Load the page or flow being optimized
Identify high-impact elements to test (prioritize by traffic × conversion potential):
- Headlines and value propositions
- CTA text, color, placement
- Form length and field order
- Social proof placement and format
- Pricing display and framing
- Image vs no image, image type
Score each idea on ICE: Impact (1-10), Confidence (1-10), Ease (1-10)
Rank by ICE score. Top 3-5 become the test backlog.
Save to assets/reports/test-backlog-[page]-YYYY-MM-DD.md

Scenario 3: Analyze Test Results

Collect final data: visitors, conversions, rate per variant
Calculate: observed lift, confidence interval, p-value
Check for:
- Sample ratio mismatch (SRM) — are variants receiving equal traffic?
- Novelty effect — did variant performance decay over time?
- Segment differences — did the effect vary by device, geography, or source?
Determine outcome using the Result Interpretation table
Document: hypothesis, setup, results, decision, learnings
Save to assets/reports/ab-test-results-[name]-YYYY-MM-DD.md

Vietnam-Specific Testing Realities

Low-Traffic Challenge

Most Vietnamese SMB websites receive 1,000-10,000 monthly visitors. This makes traditional A/B testing difficult:

Problem: At 5,000 monthly visitors and 3% baseline conversion, detecting a 20% lift requires 13,000 visitors per variant = 26,000 total = 5+ months. Impractical.

Solutions for low traffic:

Increase MDE tolerance. Accept detecting only 30-50% relative lifts. Test bold changes, not subtle tweaks. "Đăng ký ngay" vs "Dùng thử miễn phí 7 ngày" (big change) instead of button color red vs blue (tiny change).
Test on higher-traffic pages first. Homepage, pricing page, or high-traffic blog posts.
Use before/after testing instead of simultaneous split. Run Control for 2 weeks, Variant for 2 weeks, compare. Less rigorous but actionable with low traffic.
Test in paid traffic. Run identical ad sets pointing to different landing pages. Traffic is controllable and can be scaled.
Aggregate micro-conversions. Instead of testing purchase rate (rare event), test add-to-cart, scroll depth, or time-on-page (more frequent events).
Bandit testing. Use multi-armed bandit algorithms that shift traffic toward the winning variant faster. Less statistical rigor but better for low-traffic optimization.

Free and Low-Cost Testing Tools for Vietnam

Tool	Cost	Best For
Google Analytics 4 (Experiments)	Free	Basic A/B via GA4 audiences
PostHog	Free tier (1M events)	Full-featured, self-hosted option
Microsoft Clarity	Free	Heatmaps + session recordings (not A/B, but insight tool)
Splitbee / Vercel Analytics	Free tier	Simple split testing for developers
Manual UTM split	Free	Two landing pages, split traffic via UTM parameters
Facebook Ads A/B	Free (within ad spend)	Creative and audience testing

Vietnamese User Behavior in Tests

Mobile dominance: 85%+ of Vietnamese web traffic is mobile. Always test mobile-first. A desktop-only test is meaningless.
Scroll behavior: Vietnamese users scroll aggressively — below-fold content gets seen. Don't assume "above the fold" is the only battleground.
Trust signals move needles: Adding "Cam kết hoàn tiền", SĐT hotline, or Zalo button often produces larger lifts than copy changes.
Price sensitivity: Any test involving pricing or discounts will produce outsized effects. Be careful — discount tests often win but destroy margins.

Output Specification

Format: Markdown with hypothesis, parameters, timeline, and variant descriptions
Location: assets/reports/
Naming: ab-test-[name]-YYYY-MM-DD.md, test-backlog-[page]-YYYY-MM-DD.md, ab-test-results-[name]-YYYY-MM-DD.md

Quality Checklist

Hypothesis written in structured template format
Sample size calculated with stated parameters (baseline, MDE, α, power)
Feasibility checked against actual traffic volume
Low-traffic strategy applied if standard test is infeasible
Only one variable changed per variant (unless explicit MVT)
Primary metric defined (one only)
Test duration includes full weekly cycle (min 7 days)
Mobile-first consideration for Vietnamese audience
Output saved to assets/reports/ with correct naming

Related Skills

form-cro — Test ideas for form optimization
marketing-psychology — Behavioral triggers to test as variants
analytics — Tracking setup for experiment measurement
copywriting — Writing variant copy for headlines and CTAs