Debugging a prompt that won't behave
A systematic process for diagnosing prompt failures.
Most prompt debugging is done by re-asking with slightly different wording and hoping. Here's the systematic approach that actually finds the root cause.
Step 1: Read the actual trace
The first and most ignored step. Pull the full model request and response. Look at:
- The exact system prompt — is it what you think it is?
- The exact user input — any unexpected whitespace, templated variables that didn't interpolate, tokens you didn't mean to include?
- The exact response — not your parsed version, the raw text.
In 40% of prompt bugs, the root cause is visible here: a variable that interpolated as undefined, a stray newline, a leftover example from copy-paste.
Step 2: Binary search the prompt
Suspected that a specific instruction is being ignored? Remove half the prompt and test. Then remove half of the remainder. You're looking for the smallest change that reproduces the bug — that's the point of failure.
This feels slow; it isn't. You converge in 4-5 iterations.
Step 3: Test at the edges
The prompt works on "normal" input but fails on one case? Don't try to debug from the failure case directly. Instead:
- Test with a deliberately minimal case. If the model handles it fine, complexity is the problem.
- Test with an obviously-correct case just outside the failing one. Where's the boundary?
- Test with a case that exaggerates the failure pattern. Does the model still fail in the same way?
Common failure patterns and their fixes
| Symptom | Likely cause | Fix |
|---|---|---|
| Model returns wrong format | Format rule is buried or vague | Move to end of prompt, show example |
| Model refuses valid request | Alignment hair-trigger | Rephrase; add explicit permission |
| Model repeats itself | Context is saturated | Shorten prompt, summarize history |
| Model ignores a rule | Conflicting rules | Prioritize; remove softer rules |
| Quality drops in long chats | Truncation | Summarize mid-chat; preserve key facts |
| Output more verbose than needed | No length constraint | Add word/sentence cap to format section |
Step 4: Compare models
When the prompt seems right but the model isn't cooperating, swap models. Does Claude handle it? Does a larger GPT? That tells you whether the issue is prompt or capability.
If every model struggles, the prompt is the problem. If only one model fails, you might just be hitting that model's alignment quirks.
Step 5: Write the bug into your eval set
Every prompt bug that shipped to production is a data point too valuable to throw away. Add it to your regression evals. If a future prompt change reintroduces the bug, you'll catch it.
What not to do
- Don't fix by appending "IMPORTANT:" and more all-caps warnings. That's a local maximum.
- Don't fix by making the prompt longer. Sharpen, don't pad.
- Don't fix by switching models without first understanding the failure.
Systematic debugging is slower per-bug and much faster per-project.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.The first thing to check when a prompt is misbehaving in production is…
Q2.Binary searching a prompt means…
Continue in this track
More lessons from Prompt Engineering Mastery.
Lesson 4
Structured outputs: JSON, XML, and the tax of each
Get reliable structured data out of language models.
Lesson 5
Prompt chaining vs. one-shot prompting
When to break a task apart and when to let the model handle it whole.
Lesson 7
Role prompting and persona engineering
Make the model a specific kind of helper without losing reliability.