Skip to content
← Blog
evaluationcraft

Why your eval set is probably the problem

SA

Scholarus AI

Nov 22, 2025

Why your eval set is probably the problem

If the model keeps getting better but your metric doesn't move, the metric is lying to one of you. Usually it's lying to you.

The most common failure we see in eval sets isn't sloppy labeling. It's that the set stops reflecting the product. The distribution drifts, the users change, the edge cases the set was written around get solved, and nobody updates it.

Three signs your eval set is stale

  1. It hasn't gained cases in a month. Real usage produces new failure modes weekly.
  2. It's made of solved cases. Your top-of-the-line model passes 98% of it. All you can measure is noise.
  3. Nobody argues about labels anymore. Disagreement about what "good" looks like is the eval set doing its job. If everyone's just checking boxes, the set has lost edge.

The fix is slow

Rotating an eval set isn't glamorous — it's reading production traces, labeling the interesting ones, and retiring the boring ones. We budget an hour a week per person on a working eval. It's the single highest-leverage hour in the rotation.

Why your eval set is probably the problem — Scholarus AI