Evaluating agents (this is hard)
Why agent eval is different from LLM eval, and the harness patterns that work.
Agent eval is genuinely harder than LLM eval. The output isn't a string — it's a trajectory. Here's how teams actually measure whether their agents work.
What changes vs. single-LLM eval
For a single-call LLM, you have:
- Input → output → eval.
For an agent:
- Input → (many tool calls, reasoning, branches) → final output → eval.
Two things to evaluate now: the trajectory (was the path reasonable?) and the outcome (did it produce the right answer?).
The two-lane approach
Production teams split eval into:
- Deterministic assertions on the trajectory. Did the agent call the right tool with the right arguments? Did it avoid calling tools it shouldn't? Can test without running the full agent — just assert on trajectory patterns.
- LLM-judge on the final output. Did the answer solve the task? Graded against a rubric by a judge model.
Weighted score combines both. A run that got the right answer but via a disallowed tool still fails.
The "ringers" pattern
A frozen set of 20-50 canonical tasks that previously broke production. These never change. Any agent change must pass them. Most-valuable-single-practice for production agent teams.
Eval at multiple levels
- Unit: specific tool calls and their outputs.
- Sub-trajectory: a single loop iteration — did the model's reasoning + next action make sense?
- Full run: input → final output.
- Aggregate: pass rate, cost, latency over a test set.
Different changes affect different levels. A prompt tweak might improve sub-trajectory quality without changing aggregate pass rate (because other issues dominate). Measure at all levels.
The cost/quality tension
Agent runs cost 5-50x what single LLM calls cost. Eval runs are expensive.
Tactics:
- Sample smartly. Don't run every test on every change. Run critical paths always, full eval nightly.
- Cache partial runs — if a change only affects the planner, skip tests where the planner was already correct.
- Use a smaller model for exploratory runs. Validate the framework on a cheaper model before running full eval on the frontier model.
Human eval, still
Judge models are improving but for agent trajectories, they miss things a human would catch ("agent did the right thing but in a creepy way"). Keep a human in the loop for:
- Major changes (new tool, new architecture).
- Monthly spot-checks on random trajectories.
- Post-incident review.
What agent-specific signals look like
Track:
- Task success rate (did the agent solve it).
- Steps-to-solve (more steps = more fragility).
- Tool-call error rate (failed tool calls that the agent had to recover from).
- Replanning rate (how often the plan had to change mid-execution).
- Time to solve / cost to solve.
An agent that solves tasks correctly but takes 40 steps each time is fragile — one wrong tool call and it derails.
When your eval set isn't catching regressions
Classic sign: agent is getting worse in production but tests still pass. Diagnostics:
- Your test set doesn't mirror production distribution. Pull real (anonymized) traces and build eval cases from them weekly.
- Your tests only cover the happy path. Add edge cases deliberately.
- You've overfit to the test set. Hold out 20% and don't use it during development.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.Agent eval differs from single-LLM eval because…
Q2.The two-lane approach recommends…
Continue in this track
More lessons from Building AI Agents.
Lesson 4
Planning strategies: ReAct, Plan-and-Execute, and beyond
Different shapes of agent reasoning and when to use each.
Lesson 5
Multi-agent systems without the chaos
When multiple agents help, when they don't, and how to coordinate them.
Lesson 7
Agent safety and guardrails
Defense in depth for agents that take real actions.