A/B testing generative output

A/B testing generative AI is harder than testing UI changes. The output distribution is high-variance; the "correct answer" is fuzzy. Here's how to do it honestly.

What you can and can't measure

You can measure:

User behavior differences (click-through, completion, satisfaction).
Cost and latency between variants.
Refusal / error rates.

You can't measure easily:

Whether variant A is "smarter" than variant B without large samples.
Tail behavior (the 1% of cases where one variant is much better).
Long-term effects on user trust or retention from a short test.

Design tests around what's measurable. Infer the rest with appropriate caveats.

The setup

Classic A/B:

Split traffic 50/50 (or 10/90 for cautious rollouts).
Assignment sticky per user — same user gets same variant across sessions.
Log variant + outcome per request.
Monitor guardrails: error rate, cost per request, latency.

Sample size for generative outputs

Because variance is high, you need more data than you'd think.

Rough rule: 10x the sample size of a typical UI test for the same confidence. If a button-color test needed 2,000 users per variant, an AI output change might need 20,000+.

If you don't have that traffic: extend the test duration or accept wider confidence intervals.

The metrics that work

Ordered by how well they work for generative AI:

Completion / conversion rate. Did the user finish the task? Strong signal; insensitive to output variation.
Explicit feedback (thumbs, ratings). Good if you capture enough.
Follow-up behavior. Did the user ask again (implying the first answer failed)?
Time to outcome. Lower is usually better; high time can indicate confusion.
Editing distance. For drafting tools, how much did the user edit the AI draft?
Judge-LLM ratings. Useful when user signals are sparse, but prone to bias — use carefully.

Guardrails: the must-not-regress list

Before enrolling a variant to 50/50, set floors on:

Error rate: variant can't error more than baseline + 1pp.
Cost per request: can't be more than baseline × 1.5.
Latency p95: can't exceed baseline × 1.3.
Toxicity or refusal rates: can't degrade materially.

If the variant trips a guardrail, automatic rollback. Don't wait for the scheduled review.

The "small sample, big bet" anti-pattern

You roll out variant B to 2% of users for a week, see a +3% improvement in some metric, and ship it to 100%. Then it silently underperforms by -5%.

What happened: 2% × 1 week is a tiny sample. The +3% was noise. Production showed the real picture.

Don't make permanent changes from small-sample tests. Use longer windows or larger samples when the decision is long-lasting.

Qualitative eval alongside quantitative

Numbers give you whether; qualitative gives you why. During every AB test, look at:

50 random outputs from each variant.
All outputs where users gave thumbs-down.
Outlier outputs (unusually long, unusually short, error cases).

Often the numbers say "variant B is better" but qualitative shows variant B is subtly more generic. Or vice versa.

Multi-armed bandits

For ongoing optimization of high-traffic endpoints, bandits beat A/B:

Allocate more traffic to variants that are winning.
Explore new variants periodically.
Automatically retire underperformers.

Tools (Uplift, Optimizely, in-house) handle the math. Worth adopting when you have 5+ variants to evaluate continuously.

What to avoid

Testing everything at once. If you change prompt + model + retrieval simultaneously, you can't attribute wins to any one change.
Measuring for a day and calling it. Day-of-week effects, caching effects, novelty effects. Minimum one full week, usually two.
Running tests without pre-registered metrics. Post-hoc metric hunting finds false positives.

A/B testing generative AI is harder than testing UI changes. The output distribution is high-variance; the "correct answer" is fuzzy. Here's how to do it honestly.

What you can and can't measure

You can measure:

User behavior differences (click-through, completion, satisfaction).
Cost and latency between variants.
Refusal / error rates.

You can't measure easily:

Whether variant A is "smarter" than variant B without large samples.
Tail behavior (the 1% of cases where one variant is much better).
Long-term effects on user trust or retention from a short test.

Design tests around what's measurable. Infer the rest with appropriate caveats.

The setup

Classic A/B:

Split traffic 50/50 (or 10/90 for cautious rollouts).
Assignment sticky per user — same user gets same variant across sessions.
Log variant + outcome per request.
Monitor guardrails: error rate, cost per request, latency.

Sample size for generative outputs

Because variance is high, you need more data than you'd think.

Rough rule: 10x the sample size of a typical UI test for the same confidence. If a button-color test needed 2,000 users per variant, an AI output change might need 20,000+.

If you don't have that traffic: extend the test duration or accept wider confidence intervals.

The metrics that work

Ordered by how well they work for generative AI:

Completion / conversion rate. Did the user finish the task? Strong signal; insensitive to output variation.
Explicit feedback (thumbs, ratings). Good if you capture enough.
Follow-up behavior. Did the user ask again (implying the first answer failed)?
Time to outcome. Lower is usually better; high time can indicate confusion.
Editing distance. For drafting tools, how much did the user edit the AI draft?
Judge-LLM ratings. Useful when user signals are sparse, but prone to bias — use carefully.

Guardrails: the must-not-regress list

Before enrolling a variant to 50/50, set floors on:

Error rate: variant can't error more than baseline + 1pp.
Cost per request: can't be more than baseline × 1.5.
Latency p95: can't exceed baseline × 1.3.
Toxicity or refusal rates: can't degrade materially.

If the variant trips a guardrail, automatic rollback. Don't wait for the scheduled review.

The "small sample, big bet" anti-pattern

You roll out variant B to 2% of users for a week, see a +3% improvement in some metric, and ship it to 100%. Then it silently underperforms by -5%.

What happened: 2% × 1 week is a tiny sample. The +3% was noise. Production showed the real picture.

Don't make permanent changes from small-sample tests. Use longer windows or larger samples when the decision is long-lasting.

Qualitative eval alongside quantitative

Numbers give you whether; qualitative gives you why. During every AB test, look at:

50 random outputs from each variant.
All outputs where users gave thumbs-down.
Outlier outputs (unusually long, unusually short, error cases).

Often the numbers say "variant B is better" but qualitative shows variant B is subtly more generic. Or vice versa.

Multi-armed bandits

For ongoing optimization of high-traffic endpoints, bandits beat A/B:

Allocate more traffic to variants that are winning.
Explore new variants periodically.
Automatically retire underperformers.

Tools (Uplift, Optimizely, in-house) handle the math. Worth adopting when you have 5+ variants to evaluate continuously.

What to avoid

Testing everything at once. If you change prompt + model + retrieval simultaneously, you can't attribute wins to any one change.
Measuring for a day and calling it. Day-of-week effects, caching effects, novelty effects. Minimum one full week, usually two.
Running tests without pre-registered metrics. Post-hoc metric hunting finds false positives.

A/B testing generative output

What you can and can't measure

The setup

Sample size for generative outputs

The metrics that work

Guardrails: the must-not-regress list

The "small sample, big bet" anti-pattern

Qualitative eval alongside quantitative

Multi-armed bandits

What to avoid

2-question self-check

Continue in this track

A/B testing generative output

What you can and can't measure

The setup

Sample size for generative outputs

The metrics that work

Guardrails: the must-not-regress list

The "small sample, big bet" anti-pattern

Qualitative eval alongside quantitative

Multi-armed bandits

What to avoid

2-question self-check

Continue in this track