Evaluating agent platforms: benchmarks vs. real work

Every agent platform demos well. Evaluating them for production use means asking the right questions — most of which the marketing won't answer.

The evaluation framework

Five dimensions:

Capability fit — does it do what you need?
Reliability — how often does it succeed on your tasks?
Observability — can you debug when things go wrong?
Safety — how does it handle risky actions?
Integration — how does it fit your existing systems?

Score each 1-5. Don't adopt anything scoring below 3 in any dimension for production use.

Capability fit

Demo tests:

Build the actual tasks you want to use the platform for. Not "a test task" — your real task.
Run each task 10 times.
Record success rate.

Most platforms claim 90%+ success rates. Your tasks, on your data, often land at 60-80%. Measure the real number.

Reliability metrics

Success rate per task type. Classification of successes, partial successes, failures.
Variance. Consistent 70% success is more useful than 90% success with a 40% crash rate.
Failure modes. What does failure look like? Silent wrong answer? Loud error? No result?

Silent wrong answers are the most dangerous — the platform produces plausible output that's incorrect.

Observability

Questions to answer:

Can you see the agent's reasoning at each step?
Can you see tool calls with inputs and outputs?
Can you replay a run after the fact?
Can you diff runs that succeeded vs. failed?

Platforms that score well here let you debug. Ones that don't leave you guessing.

Safety

Irreversible actions — what guardrails?
Authentication flows — how are credentials handled?
Tool limits — what's the agent allowed to do?
Cost controls — what prevents runaway spend?

A platform without clear answers here is not production-ready.

Integration

API access. Can you call the platform programmatically?
Connectors to your tools. Are the things you need already integrated?
Export/import. Can you move data in and out?
Webhooks / events. Can other systems react to agent completion?

The pilot design

For each serious candidate:

Pick 3-5 representative tasks.
Run each 20 times across varied inputs.
Score: success rate, time per task, cost per task.
Test failure recovery: break something on purpose, see how the agent handles it.
Test edge cases: unusual inputs, adversarial ones.
Involve 2-3 real users for usability feedback.

Duration: 2-4 weeks. Output: a data-backed go/no-go.

Ignoring marketing

Ignore:

"Up to 95% accuracy" — up-to claims are the lowest bar anyone ever hits.
Cherry-picked demos. Tasks that show best case.
Customer logos without specifics on the use case.

Pay attention to:

Published failure modes. Platforms that document when they don't work are more trustworthy.
Named reference customers willing to share their actual experience.
Clear data handling terms.

Cost at scale

Platform pricing structures to examine:

Per-task pricing. Unpredictable; watch for runaway costs.
Per-seat. Predictable but can under/over-provision.
Compute-based. You pay for runtime; long tasks cost more.
Hybrid. Complex; read the fine print.

Estimate monthly cost at 2x expected volume. Platforms with cost that scales superlinearly with usage are risky.

Vendor stability

Company age and funding.
Customer base and retention signals.
API stability commitments.
Acquisition risk — would you be comfortable if a larger company bought them?

The three-way bake-off

If you're deciding between platforms:

Pick 3 finalists.
Run the same tasks on each.
Same inputs, same evaluation criteria.
Score side by side.

No single platform wins every dimension. You pick based on the tradeoffs that matter most for your use case.

The decision

After evaluation:

Clear winner: adopt.
Close tie: the integration / support / vendor fit usually decides.
None meets bar: wait 6 months; the category moves fast.

Every agent platform demos well. Evaluating them for production use means asking the right questions — most of which the marketing won't answer.

The evaluation framework

Five dimensions:

Capability fit — does it do what you need?
Reliability — how often does it succeed on your tasks?
Observability — can you debug when things go wrong?
Safety — how does it handle risky actions?
Integration — how does it fit your existing systems?

Score each 1-5. Don't adopt anything scoring below 3 in any dimension for production use.

Capability fit

Demo tests:

Build the actual tasks you want to use the platform for. Not "a test task" — your real task.
Run each task 10 times.
Record success rate.

Most platforms claim 90%+ success rates. Your tasks, on your data, often land at 60-80%. Measure the real number.

Reliability metrics

Success rate per task type. Classification of successes, partial successes, failures.
Variance. Consistent 70% success is more useful than 90% success with a 40% crash rate.
Failure modes. What does failure look like? Silent wrong answer? Loud error? No result?

Silent wrong answers are the most dangerous — the platform produces plausible output that's incorrect.

Observability

Questions to answer:

Can you see the agent's reasoning at each step?
Can you see tool calls with inputs and outputs?
Can you replay a run after the fact?
Can you diff runs that succeeded vs. failed?

Platforms that score well here let you debug. Ones that don't leave you guessing.

Safety

Irreversible actions — what guardrails?
Authentication flows — how are credentials handled?
Tool limits — what's the agent allowed to do?
Cost controls — what prevents runaway spend?

A platform without clear answers here is not production-ready.

Integration

API access. Can you call the platform programmatically?
Connectors to your tools. Are the things you need already integrated?
Export/import. Can you move data in and out?
Webhooks / events. Can other systems react to agent completion?

The pilot design

For each serious candidate:

Pick 3-5 representative tasks.
Run each 20 times across varied inputs.
Score: success rate, time per task, cost per task.
Test failure recovery: break something on purpose, see how the agent handles it.
Test edge cases: unusual inputs, adversarial ones.
Involve 2-3 real users for usability feedback.

Duration: 2-4 weeks. Output: a data-backed go/no-go.

Ignoring marketing

Ignore:

"Up to 95% accuracy" — up-to claims are the lowest bar anyone ever hits.
Cherry-picked demos. Tasks that show best case.
Customer logos without specifics on the use case.

Pay attention to:

Published failure modes. Platforms that document when they don't work are more trustworthy.
Named reference customers willing to share their actual experience.
Clear data handling terms.

Cost at scale

Platform pricing structures to examine:

Per-task pricing. Unpredictable; watch for runaway costs.
Per-seat. Predictable but can under/over-provision.
Compute-based. You pay for runtime; long tasks cost more.
Hybrid. Complex; read the fine print.

Estimate monthly cost at 2x expected volume. Platforms with cost that scales superlinearly with usage are risky.

Vendor stability

Company age and funding.
Customer base and retention signals.
API stability commitments.
Acquisition risk — would you be comfortable if a larger company bought them?

The three-way bake-off

If you're deciding between platforms:

Pick 3 finalists.
Run the same tasks on each.
Same inputs, same evaluation criteria.
Score side by side.

No single platform wins every dimension. You pick based on the tradeoffs that matter most for your use case.

The decision

After evaluation:

Clear winner: adopt.
Close tie: the integration / support / vendor fit usually decides.
None meets bar: wait 6 months; the category moves fast.

Evaluating agent platforms: benchmarks vs. real work

The evaluation framework

Capability fit

Reliability metrics

Observability

Safety

Integration

The pilot design

Ignoring marketing

Cost at scale

Vendor stability

The three-way bake-off

The decision

2-question self-check

Continue in this track

Evaluating agent platforms: benchmarks vs. real work

The evaluation framework

Capability fit

Reliability metrics

Observability

Safety

Integration

The pilot design

Ignoring marketing

Cost at scale

Vendor stability

The three-way bake-off

The decision

2-question self-check

Continue in this track