Debugging a prompt that won't behave

Most prompt debugging is done by re-asking with slightly different wording and hoping. Here's the systematic approach that actually finds the root cause.

Step 1: Read the actual trace

The first and most ignored step. Pull the full model request and response. Look at:

The exact system prompt — is it what you think it is?
The exact user input — any unexpected whitespace, templated variables that didn't interpolate, tokens you didn't mean to include?
The exact response — not your parsed version, the raw text.

In 40% of prompt bugs, the root cause is visible here: a variable that interpolated as undefined, a stray newline, a leftover example from copy-paste.

Step 2: Binary search the prompt

Suspected that a specific instruction is being ignored? Remove half the prompt and test. Then remove half of the remainder. You're looking for the smallest change that reproduces the bug — that's the point of failure.

This feels slow; it isn't. You converge in 4-5 iterations.

Step 3: Test at the edges

The prompt works on "normal" input but fails on one case? Don't try to debug from the failure case directly. Instead:

Test with a deliberately minimal case. If the model handles it fine, complexity is the problem.
Test with an obviously-correct case just outside the failing one. Where's the boundary?
Test with a case that exaggerates the failure pattern. Does the model still fail in the same way?

Common failure patterns and their fixes

Symptom	Likely cause	Fix
Model returns wrong format	Format rule is buried or vague	Move to end of prompt, show example
Model refuses valid request	Alignment hair-trigger	Rephrase; add explicit permission
Model repeats itself	Context is saturated	Shorten prompt, summarize history
Model ignores a rule	Conflicting rules	Prioritize; remove softer rules
Quality drops in long chats	Truncation	Summarize mid-chat; preserve key facts
Output more verbose than needed	No length constraint	Add word/sentence cap to format section

Step 4: Compare models

When the prompt seems right but the model isn't cooperating, swap models. Does Claude handle it? Does a larger GPT? That tells you whether the issue is prompt or capability.

If every model struggles, the prompt is the problem. If only one model fails, you might just be hitting that model's alignment quirks.

Step 5: Write the bug into your eval set

Every prompt bug that shipped to production is a data point too valuable to throw away. Add it to your regression evals. If a future prompt change reintroduces the bug, you'll catch it.

What not to do

Don't fix by appending "IMPORTANT:" and more all-caps warnings. That's a local maximum.
Don't fix by making the prompt longer. Sharpen, don't pad.
Don't fix by switching models without first understanding the failure.

Systematic debugging is slower per-bug and much faster per-project.

Most prompt debugging is done by re-asking with slightly different wording and hoping. Here's the systematic approach that actually finds the root cause.

Step 1: Read the actual trace

The first and most ignored step. Pull the full model request and response. Look at:

The exact system prompt — is it what you think it is?
The exact user input — any unexpected whitespace, templated variables that didn't interpolate, tokens you didn't mean to include?
The exact response — not your parsed version, the raw text.

In 40% of prompt bugs, the root cause is visible here: a variable that interpolated as undefined, a stray newline, a leftover example from copy-paste.

Step 2: Binary search the prompt

This feels slow; it isn't. You converge in 4-5 iterations.

Step 3: Test at the edges

The prompt works on "normal" input but fails on one case? Don't try to debug from the failure case directly. Instead:

Test with a deliberately minimal case. If the model handles it fine, complexity is the problem.
Test with an obviously-correct case just outside the failing one. Where's the boundary?
Test with a case that exaggerates the failure pattern. Does the model still fail in the same way?

Common failure patterns and their fixes

Symptom	Likely cause	Fix
Model returns wrong format	Format rule is buried or vague	Move to end of prompt, show example
Model refuses valid request	Alignment hair-trigger	Rephrase; add explicit permission
Model repeats itself	Context is saturated	Shorten prompt, summarize history
Model ignores a rule	Conflicting rules	Prioritize; remove softer rules
Quality drops in long chats	Truncation	Summarize mid-chat; preserve key facts
Output more verbose than needed	No length constraint	Add word/sentence cap to format section

Step 4: Compare models

When the prompt seems right but the model isn't cooperating, swap models. Does Claude handle it? Does a larger GPT? That tells you whether the issue is prompt or capability.

If every model struggles, the prompt is the problem. If only one model fails, you might just be hitting that model's alignment quirks.

Step 5: Write the bug into your eval set

Every prompt bug that shipped to production is a data point too valuable to throw away. Add it to your regression evals. If a future prompt change reintroduces the bug, you'll catch it.

What not to do

Don't fix by appending "IMPORTANT:" and more all-caps warnings. That's a local maximum.
Don't fix by making the prompt longer. Sharpen, don't pad.
Don't fix by switching models without first understanding the failure.

Systematic debugging is slower per-bug and much faster per-project.

Debugging a prompt that won't behave

Step 1: Read the actual trace

Step 2: Binary search the prompt

Step 3: Test at the edges

Common failure patterns and their fixes

Step 4: Compare models

Step 5: Write the bug into your eval set

What not to do

2-question self-check

Continue in this track

Debugging a prompt that won't behave

Step 1: Read the actual trace

Step 2: Binary search the prompt

Step 3: Test at the edges

Common failure patterns and their fixes

Step 4: Compare models

Step 5: Write the bug into your eval set

What not to do

2-question self-check

Continue in this track