Why your eval set is probably the problem

If the model keeps getting better but your metric doesn't move, the metric is lying to one of you. Usually it's lying to you.

The most common failure we see in eval sets isn't sloppy labeling. It's that the set stops reflecting the product. The distribution drifts, the users change, the edge cases the set was written around get solved, and nobody updates it.

Three signs your eval set is stale

It hasn't gained cases in a month. Real usage produces new failure modes weekly.
It's made of solved cases. Your top-of-the-line model passes 98% of it. All you can measure is noise.
Nobody argues about labels anymore. Disagreement about what "good" looks like is the eval set doing its job. If everyone's just checking boxes, the set has lost edge.

The fix is slow

Rotating an eval set isn't glamorous — it's reading production traces, labeling the interesting ones, and retiring the boring ones. We budget an hour a week per person on a working eval. It's the single highest-leverage hour in the rotation.