Prompt testing frameworks: deterministic and judge-based
Build a test harness that catches prompt regressions before they ship.
A prompt without tests is a production bug waiting for its moment. Here's how to build a test harness that catches prompt regressions before they ship.
The testing pyramid, adapted for prompts
From cheapest to most expensive:
- Schema/format tests. Deterministic: does the output match the expected structure? 100% automated. Run on every PR.
- Unit tests. Specific input → specific expected output (for classification, extraction, short-answer tasks). Deterministic comparison. Run on every PR.
- Judge-based tests. LLM judges open-ended outputs against a rubric. Moderate cost; run on every PR if budget allows, nightly otherwise.
- Human eval. Spot-checks by domain experts. Quarterly or on major changes.
Building a unit test suite
For classification and extraction prompts, you want 50-200 labeled examples. Structure:
tests/triage-classifier/
cases.yml # input + expected output per case
ringers.yml # cases that broke production before — never regress
runner.ts # executes prompt, compares, reports
# cases.yml
- id: TC-001
input: "My invoice has a 200-dollar error on it..."
expected:
category: billing
priority: high
- id: TC-002
...
Runner invokes the prompt, parses the response, compares against expected, reports precision/recall per field.
Judge-based testing in practice
For open-ended outputs (summaries, rewrites, code generation), deterministic comparison doesn't work. Use a judge model with a rubric:
You are grading a meeting summary. The ideal summary:
1. Mentions all decisions made.
2. Lists action items with owners.
3. Is under 300 words.
4. Doesn't editorialize.
Score each criterion 1-5 and return JSON.
Key practices:
- Use a different model family for the judge than the one producing the output.
- Calibrate the rubric on a small set of human-graded examples first.
- Log judge disagreement. When the judge says "5" but a human says "2," the rubric needs work.
The regression set discipline
Every bug that reaches production goes into the regression set with a comment ("Was classifying billing as technical before v2.1"). These tests never get deleted. A PR that regresses any of them doesn't merge.
This single practice catches the majority of prompt bugs before they recur.
Test coverage for prompts
A prompt's "test coverage" is less about code paths and more about behavioral coverage:
- Does your test set cover edge cases (empty input, very long input, adversarial input)?
- Does it cover each category you classify with realistic diversity?
- Does it exercise each rule in the prompt at least once?
Map rules to tests. If you have 7 rules in the prompt and 3 are never tested, those are latent bugs.
When tests conflict with eval
Unit tests are fine-grained (specific input → specific output). Eval is aggregate (10% precision improvement). Both matter; they catch different things. Don't pick one.
Tooling
- Bespoke scripts. Most teams start here. Fast, no dependency overhead. Weak on orchestration.
- Promptfoo, DeepEval, LangSmith evals, Braintrust. Various platforms offer promptfile + test runner + dashboards. Worth adopting once you have >50 tests.
- CI integration. Whatever framework, run tests on every PR. The point is the feedback loop.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.When is an LLM-as-judge appropriate?
Q2.A key trait of a solid 'regression' test set is…