Evaluating AI output: the trust gap

"Eval" is the most neglected word in AI engineering. Most teams judge model quality by vibes. Here's how to do it properly — and why it matters more than picking the right model.

The three layers of eval

Offline eval. A curated set of inputs with known-good outputs, run on every change. Catches regressions.
Online eval. Live traffic signals — user thumbs, conversion rates, follow-up prompts — attributed to the model version.
Adversarial eval. Deliberate attempts to break the system — prompt injection, edge cases, known failure patterns.

You need all three. Most teams have zero or one and wonder why production feels unreliable.

Deterministic vs. judge-based eval

Deterministic means a function returns true or false. "Did the output contain a valid JSON blob with these fields?" — deterministic. "Did the summary include the key point?" — not deterministic.

Judge-based uses another LLM (often a stronger one) to grade the output against a rubric. Cheaper than human eval, more subjective than deterministic. Most modern eval harnesses combine both.

The regression set that saves you

The single most valuable thing you can build: 20-40 labeled examples of exactly the cases where your system has failed in production. Never delete them. Run them on every change. Call them "the ringers."

If a new prompt or new model scores worse on the ringers, it doesn't ship. This one practice has prevented more production incidents than any other we've seen.

LLM-as-judge caveats

Using GPT-4o to grade GPT-4o's output has known biases:

Same-family bias: models rate outputs from their own family higher.
Length bias: longer answers rated higher, quality aside.
Authority bias: confident-sounding answers rated higher, correctness aside.

Mitigations: use a different model family as judge. Give the judge a tight rubric, not "rate 1-10." Show it examples of "good" and "bad" before it grades.

What good eval looks like

Signal	Source	Frequency
Pass rate on labeled eval set	Offline	Every PR
Pass rate on regression "ringers"	Offline	Every PR — hard gate
User thumbs up/down	Online	Live dashboard
Prompt injection red-team	Adversarial	Quarterly
Free-form user feedback	Online	Weekly triage

The honest starting point

If you don't have eval today, build the regression set. That's it. Thirty labeled examples where the right behavior is clear. Run them before every prompt change. Everything else is a later optimization.

"Eval" is the most neglected word in AI engineering. Most teams judge model quality by vibes. Here's how to do it properly — and why it matters more than picking the right model.

The three layers of eval

Offline eval. A curated set of inputs with known-good outputs, run on every change. Catches regressions.
Online eval. Live traffic signals — user thumbs, conversion rates, follow-up prompts — attributed to the model version.
Adversarial eval. Deliberate attempts to break the system — prompt injection, edge cases, known failure patterns.

You need all three. Most teams have zero or one and wonder why production feels unreliable.

Deterministic vs. judge-based eval

Judge-based uses another LLM (often a stronger one) to grade the output against a rubric. Cheaper than human eval, more subjective than deterministic. Most modern eval harnesses combine both.

The regression set that saves you

If a new prompt or new model scores worse on the ringers, it doesn't ship. This one practice has prevented more production incidents than any other we've seen.

LLM-as-judge caveats

Using GPT-4o to grade GPT-4o's output has known biases:

Same-family bias: models rate outputs from their own family higher.
Length bias: longer answers rated higher, quality aside.
Authority bias: confident-sounding answers rated higher, correctness aside.

Mitigations: use a different model family as judge. Give the judge a tight rubric, not "rate 1-10." Show it examples of "good" and "bad" before it grades.

What good eval looks like

Signal	Source	Frequency
Pass rate on labeled eval set	Offline	Every PR
Pass rate on regression "ringers"	Offline	Every PR — hard gate
User thumbs up/down	Online	Live dashboard
Prompt injection red-team	Adversarial	Quarterly
Free-form user feedback	Online	Weekly triage

Evaluating AI output: the trust gap

The three layers of eval

Deterministic vs. judge-based eval

The regression set that saves you

LLM-as-judge caveats

What good eval looks like

The honest starting point

2-question self-check

Continue in this track

Evaluating AI output: the trust gap

The three layers of eval

Deterministic vs. judge-based eval

The regression set that saves you

LLM-as-judge caveats

What good eval looks like

The honest starting point

2-question self-check

Continue in this track