A/B testing generative output
How to run valid experiments when every response is different.
A/B testing generative AI is harder than testing UI changes. The output distribution is high-variance; the "correct answer" is fuzzy. Here's how to do it honestly.
What you can and can't measure
You can measure:
- User behavior differences (click-through, completion, satisfaction).
- Cost and latency between variants.
- Refusal / error rates.
You can't measure easily:
- Whether variant A is "smarter" than variant B without large samples.
- Tail behavior (the 1% of cases where one variant is much better).
- Long-term effects on user trust or retention from a short test.
Design tests around what's measurable. Infer the rest with appropriate caveats.
The setup
Classic A/B:
- Split traffic 50/50 (or 10/90 for cautious rollouts).
- Assignment sticky per user — same user gets same variant across sessions.
- Log variant + outcome per request.
- Monitor guardrails: error rate, cost per request, latency.
Sample size for generative outputs
Because variance is high, you need more data than you'd think.
Rough rule: 10x the sample size of a typical UI test for the same confidence. If a button-color test needed 2,000 users per variant, an AI output change might need 20,000+.
If you don't have that traffic: extend the test duration or accept wider confidence intervals.
The metrics that work
Ordered by how well they work for generative AI:
- Completion / conversion rate. Did the user finish the task? Strong signal; insensitive to output variation.
- Explicit feedback (thumbs, ratings). Good if you capture enough.
- Follow-up behavior. Did the user ask again (implying the first answer failed)?
- Time to outcome. Lower is usually better; high time can indicate confusion.
- Editing distance. For drafting tools, how much did the user edit the AI draft?
- Judge-LLM ratings. Useful when user signals are sparse, but prone to bias — use carefully.
Guardrails: the must-not-regress list
Before enrolling a variant to 50/50, set floors on:
- Error rate: variant can't error more than baseline + 1pp.
- Cost per request: can't be more than baseline × 1.5.
- Latency p95: can't exceed baseline × 1.3.
- Toxicity or refusal rates: can't degrade materially.
If the variant trips a guardrail, automatic rollback. Don't wait for the scheduled review.
The "small sample, big bet" anti-pattern
You roll out variant B to 2% of users for a week, see a +3% improvement in some metric, and ship it to 100%. Then it silently underperforms by -5%.
What happened: 2% × 1 week is a tiny sample. The +3% was noise. Production showed the real picture.
Don't make permanent changes from small-sample tests. Use longer windows or larger samples when the decision is long-lasting.
Qualitative eval alongside quantitative
Numbers give you whether; qualitative gives you why. During every AB test, look at:
- 50 random outputs from each variant.
- All outputs where users gave thumbs-down.
- Outlier outputs (unusually long, unusually short, error cases).
Often the numbers say "variant B is better" but qualitative shows variant B is subtly more generic. Or vice versa.
Multi-armed bandits
For ongoing optimization of high-traffic endpoints, bandits beat A/B:
- Allocate more traffic to variants that are winning.
- Explore new variants periodically.
- Automatically retire underperformers.
Tools (Uplift, Optimizely, in-house) handle the math. Worth adopting when you have 5+ variants to evaluate continuously.
What to avoid
- Testing everything at once. If you change prompt + model + retrieval simultaneously, you can't attribute wins to any one change.
- Measuring for a day and calling it. Day-of-week effects, caching effects, novelty effects. Minimum one full week, usually two.
- Running tests without pre-registered metrics. Post-hoc metric hunting finds false positives.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.Compared to traditional UI A/B tests, generative A/B tests need…
Q2.Which metric is typically WEAKEST for judging generative quality?
Continue in this track
More lessons from Deploying AI at Scale.
Lesson 5
Observability for AI systems
What to trace, log, and alert on when the unit of work is a generation.
Lesson 6
Handling failure gracefully: timeouts, fallbacks, degraded modes
Your AI service will fail. These are the patterns for surviving it.
Lesson 8
Scaling inference: the playbook at 10k, 100k, and 1M users
What breaks first, what to batch, and when to switch providers.