Prompt testing frameworks: deterministic and judge-based

A prompt without tests is a production bug waiting for its moment. Here's how to build a test harness that catches prompt regressions before they ship.

The testing pyramid, adapted for prompts

From cheapest to most expensive:

Schema/format tests. Deterministic: does the output match the expected structure? 100% automated. Run on every PR.
Unit tests. Specific input → specific expected output (for classification, extraction, short-answer tasks). Deterministic comparison. Run on every PR.
Judge-based tests. LLM judges open-ended outputs against a rubric. Moderate cost; run on every PR if budget allows, nightly otherwise.
Human eval. Spot-checks by domain experts. Quarterly or on major changes.

Building a unit test suite

For classification and extraction prompts, you want 50-200 labeled examples. Structure:

tests/triage-classifier/
  cases.yml          # input + expected output per case
  ringers.yml        # cases that broke production before — never regress
  runner.ts          # executes prompt, compares, reports

# cases.yml
- id: TC-001
  input: "My invoice has a 200-dollar error on it..."
  expected:
    category: billing
    priority: high
- id: TC-002
  ...

Runner invokes the prompt, parses the response, compares against expected, reports precision/recall per field.

Judge-based testing in practice

For open-ended outputs (summaries, rewrites, code generation), deterministic comparison doesn't work. Use a judge model with a rubric:

You are grading a meeting summary. The ideal summary:
1. Mentions all decisions made.
2. Lists action items with owners.
3. Is under 300 words.
4. Doesn't editorialize.

Score each criterion 1-5 and return JSON.

Key practices:

Use a different model family for the judge than the one producing the output.
Calibrate the rubric on a small set of human-graded examples first.
Log judge disagreement. When the judge says "5" but a human says "2," the rubric needs work.

The regression set discipline

Every bug that reaches production goes into the regression set with a comment ("Was classifying billing as technical before v2.1"). These tests never get deleted. A PR that regresses any of them doesn't merge.

This single practice catches the majority of prompt bugs before they recur.

Test coverage for prompts

A prompt's "test coverage" is less about code paths and more about behavioral coverage:

Does your test set cover edge cases (empty input, very long input, adversarial input)?
Does it cover each category you classify with realistic diversity?
Does it exercise each rule in the prompt at least once?

Map rules to tests. If you have 7 rules in the prompt and 3 are never tested, those are latent bugs.

When tests conflict with eval

Unit tests are fine-grained (specific input → specific output). Eval is aggregate (10% precision improvement). Both matter; they catch different things. Don't pick one.

Tooling

Bespoke scripts. Most teams start here. Fast, no dependency overhead. Weak on orchestration.
Promptfoo, DeepEval, LangSmith evals, Braintrust. Various platforms offer promptfile + test runner + dashboards. Worth adopting once you have >50 tests.
CI integration. Whatever framework, run tests on every PR. The point is the feedback loop.

A prompt without tests is a production bug waiting for its moment. Here's how to build a test harness that catches prompt regressions before they ship.

The testing pyramid, adapted for prompts

From cheapest to most expensive:

Schema/format tests. Deterministic: does the output match the expected structure? 100% automated. Run on every PR.
Unit tests. Specific input → specific expected output (for classification, extraction, short-answer tasks). Deterministic comparison. Run on every PR.
Judge-based tests. LLM judges open-ended outputs against a rubric. Moderate cost; run on every PR if budget allows, nightly otherwise.
Human eval. Spot-checks by domain experts. Quarterly or on major changes.

Building a unit test suite

For classification and extraction prompts, you want 50-200 labeled examples. Structure:

tests/triage-classifier/
  cases.yml          # input + expected output per case
  ringers.yml        # cases that broke production before — never regress
  runner.ts          # executes prompt, compares, reports

# cases.yml
- id: TC-001
  input: "My invoice has a 200-dollar error on it..."
  expected:
    category: billing
    priority: high
- id: TC-002
  ...

Runner invokes the prompt, parses the response, compares against expected, reports precision/recall per field.

Judge-based testing in practice

For open-ended outputs (summaries, rewrites, code generation), deterministic comparison doesn't work. Use a judge model with a rubric:

You are grading a meeting summary. The ideal summary:
1. Mentions all decisions made.
2. Lists action items with owners.
3. Is under 300 words.
4. Doesn't editorialize.

Score each criterion 1-5 and return JSON.

Key practices:

Use a different model family for the judge than the one producing the output.
Calibrate the rubric on a small set of human-graded examples first.
Log judge disagreement. When the judge says "5" but a human says "2," the rubric needs work.

The regression set discipline

This single practice catches the majority of prompt bugs before they recur.

Test coverage for prompts

A prompt's "test coverage" is less about code paths and more about behavioral coverage:

Does your test set cover edge cases (empty input, very long input, adversarial input)?
Does it cover each category you classify with realistic diversity?
Does it exercise each rule in the prompt at least once?

Map rules to tests. If you have 7 rules in the prompt and 3 are never tested, those are latent bugs.

When tests conflict with eval

Unit tests are fine-grained (specific input → specific output). Eval is aggregate (10% precision improvement). Both matter; they catch different things. Don't pick one.

Tooling

Bespoke scripts. Most teams start here. Fast, no dependency overhead. Weak on orchestration.
Promptfoo, DeepEval, LangSmith evals, Braintrust. Various platforms offer promptfile + test runner + dashboards. Worth adopting once you have >50 tests.
CI integration. Whatever framework, run tests on every PR. The point is the feedback loop.

Prompt testing frameworks: deterministic and judge-based

The testing pyramid, adapted for prompts

Building a unit test suite

Judge-based testing in practice

The regression set discipline

Test coverage for prompts

When tests conflict with eval

Tooling

2-question self-check

Continue in this track

Prompt testing frameworks: deterministic and judge-based

The testing pyramid, adapted for prompts

Building a unit test suite

Judge-based testing in practice

The regression set discipline

Test coverage for prompts

When tests conflict with eval

Tooling

2-question self-check

Continue in this track