Evaluating AI output: the trust gap
How to know if a model's answer is actually good, beyond vibes.
"Eval" is the most neglected word in AI engineering. Most teams judge model quality by vibes. Here's how to do it properly — and why it matters more than picking the right model.
The three layers of eval
- Offline eval. A curated set of inputs with known-good outputs, run on every change. Catches regressions.
- Online eval. Live traffic signals — user thumbs, conversion rates, follow-up prompts — attributed to the model version.
- Adversarial eval. Deliberate attempts to break the system — prompt injection, edge cases, known failure patterns.
You need all three. Most teams have zero or one and wonder why production feels unreliable.
Deterministic vs. judge-based eval
Deterministic means a function returns true or false. "Did the output contain a valid JSON blob with these fields?" — deterministic. "Did the summary include the key point?" — not deterministic.
Judge-based uses another LLM (often a stronger one) to grade the output against a rubric. Cheaper than human eval, more subjective than deterministic. Most modern eval harnesses combine both.
The regression set that saves you
The single most valuable thing you can build: 20-40 labeled examples of exactly the cases where your system has failed in production. Never delete them. Run them on every change. Call them "the ringers."
If a new prompt or new model scores worse on the ringers, it doesn't ship. This one practice has prevented more production incidents than any other we've seen.
LLM-as-judge caveats
Using GPT-4o to grade GPT-4o's output has known biases:
- Same-family bias: models rate outputs from their own family higher.
- Length bias: longer answers rated higher, quality aside.
- Authority bias: confident-sounding answers rated higher, correctness aside.
Mitigations: use a different model family as judge. Give the judge a tight rubric, not "rate 1-10." Show it examples of "good" and "bad" before it grades.
What good eval looks like
| Signal | Source | Frequency |
|---|---|---|
| Pass rate on labeled eval set | Offline | Every PR |
| Pass rate on regression "ringers" | Offline | Every PR — hard gate |
| User thumbs up/down | Online | Live dashboard |
| Prompt injection red-team | Adversarial | Quarterly |
| Free-form user feedback | Online | Weekly triage |
The honest starting point
If you don't have eval today, build the regression set. That's it. Thirty labeled examples where the right behavior is clear. Run them before every prompt change. Everything else is a later optimization.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.What does the lesson recommend as the single most valuable eval practice?
Q2.A known bias of LLM-as-judge eval is…
Continue in this track
More lessons from AI Fundamentals.
Lesson 5
Temperature, top-p, and the knobs nobody explains
The sampling parameters that shape creativity, determinism, and diversity.
Lesson 6
When to use AI — and when you really shouldn't
A practical framework for identifying where AI adds value and where it doesn't.
Lesson 8
The AI landscape: who makes what, and why it changes every month
A map of frontier labs, open models, and the trade-offs between them.