Evaluating agents (this is hard)

Agent eval is genuinely harder than LLM eval. The output isn't a string — it's a trajectory. Here's how teams actually measure whether their agents work.

What changes vs. single-LLM eval

For a single-call LLM, you have:

Input → output → eval.

For an agent:

Input → (many tool calls, reasoning, branches) → final output → eval.

Two things to evaluate now: the trajectory (was the path reasonable?) and the outcome (did it produce the right answer?).

The two-lane approach

Production teams split eval into:

Deterministic assertions on the trajectory. Did the agent call the right tool with the right arguments? Did it avoid calling tools it shouldn't? Can test without running the full agent — just assert on trajectory patterns.
LLM-judge on the final output. Did the answer solve the task? Graded against a rubric by a judge model.

Weighted score combines both. A run that got the right answer but via a disallowed tool still fails.

The "ringers" pattern

A frozen set of 20-50 canonical tasks that previously broke production. These never change. Any agent change must pass them. Most-valuable-single-practice for production agent teams.

Eval at multiple levels

Unit: specific tool calls and their outputs.
Sub-trajectory: a single loop iteration — did the model's reasoning + next action make sense?
Full run: input → final output.
Aggregate: pass rate, cost, latency over a test set.

Different changes affect different levels. A prompt tweak might improve sub-trajectory quality without changing aggregate pass rate (because other issues dominate). Measure at all levels.

The cost/quality tension

Agent runs cost 5-50x what single LLM calls cost. Eval runs are expensive.

Tactics:

Sample smartly. Don't run every test on every change. Run critical paths always, full eval nightly.
Cache partial runs — if a change only affects the planner, skip tests where the planner was already correct.
Use a smaller model for exploratory runs. Validate the framework on a cheaper model before running full eval on the frontier model.

Human eval, still

Judge models are improving but for agent trajectories, they miss things a human would catch ("agent did the right thing but in a creepy way"). Keep a human in the loop for:

Major changes (new tool, new architecture).
Monthly spot-checks on random trajectories.
Post-incident review.

What agent-specific signals look like

Track:

Task success rate (did the agent solve it).
Steps-to-solve (more steps = more fragility).
Tool-call error rate (failed tool calls that the agent had to recover from).
Replanning rate (how often the plan had to change mid-execution).
Time to solve / cost to solve.

An agent that solves tasks correctly but takes 40 steps each time is fragile — one wrong tool call and it derails.

When your eval set isn't catching regressions

Classic sign: agent is getting worse in production but tests still pass. Diagnostics:

Your test set doesn't mirror production distribution. Pull real (anonymized) traces and build eval cases from them weekly.
Your tests only cover the happy path. Add edge cases deliberately.
You've overfit to the test set. Hold out 20% and don't use it during development.

Agent eval is genuinely harder than LLM eval. The output isn't a string — it's a trajectory. Here's how teams actually measure whether their agents work.

What changes vs. single-LLM eval

For a single-call LLM, you have:

Input → output → eval.

For an agent:

Input → (many tool calls, reasoning, branches) → final output → eval.

Two things to evaluate now: the trajectory (was the path reasonable?) and the outcome (did it produce the right answer?).

The two-lane approach

Production teams split eval into:

Deterministic assertions on the trajectory. Did the agent call the right tool with the right arguments? Did it avoid calling tools it shouldn't? Can test without running the full agent — just assert on trajectory patterns.
LLM-judge on the final output. Did the answer solve the task? Graded against a rubric by a judge model.

Weighted score combines both. A run that got the right answer but via a disallowed tool still fails.

The "ringers" pattern

A frozen set of 20-50 canonical tasks that previously broke production. These never change. Any agent change must pass them. Most-valuable-single-practice for production agent teams.

Eval at multiple levels

Unit: specific tool calls and their outputs.
Sub-trajectory: a single loop iteration — did the model's reasoning + next action make sense?
Full run: input → final output.
Aggregate: pass rate, cost, latency over a test set.

Different changes affect different levels. A prompt tweak might improve sub-trajectory quality without changing aggregate pass rate (because other issues dominate). Measure at all levels.

The cost/quality tension

Agent runs cost 5-50x what single LLM calls cost. Eval runs are expensive.

Tactics:

Sample smartly. Don't run every test on every change. Run critical paths always, full eval nightly.
Cache partial runs — if a change only affects the planner, skip tests where the planner was already correct.
Use a smaller model for exploratory runs. Validate the framework on a cheaper model before running full eval on the frontier model.

Human eval, still

Judge models are improving but for agent trajectories, they miss things a human would catch ("agent did the right thing but in a creepy way"). Keep a human in the loop for:

Major changes (new tool, new architecture).
Monthly spot-checks on random trajectories.
Post-incident review.

What agent-specific signals look like

Track:

Task success rate (did the agent solve it).
Steps-to-solve (more steps = more fragility).
Tool-call error rate (failed tool calls that the agent had to recover from).
Replanning rate (how often the plan had to change mid-execution).
Time to solve / cost to solve.

An agent that solves tasks correctly but takes 40 steps each time is fragile — one wrong tool call and it derails.

When your eval set isn't catching regressions

Classic sign: agent is getting worse in production but tests still pass. Diagnostics:

Your test set doesn't mirror production distribution. Pull real (anonymized) traces and build eval cases from them weekly.
Your tests only cover the happy path. Add edge cases deliberately.
You've overfit to the test set. Hold out 20% and don't use it during development.

Evaluating agents (this is hard)

What changes vs. single-LLM eval

The two-lane approach

The "ringers" pattern

Eval at multiple levels

The cost/quality tension

Human eval, still

What agent-specific signals look like

When your eval set isn't catching regressions

2-question self-check

Continue in this track

Evaluating agents (this is hard)

What changes vs. single-LLM eval

The two-lane approach

The "ringers" pattern

Eval at multiple levels

The cost/quality tension

Human eval, still

What agent-specific signals look like

When your eval set isn't catching regressions

2-question self-check

Continue in this track