Capstone: a production-grade prompt from scratch
Assemble everything in a single, production-ready prompt with evals.
A production-ready prompt isn't something you write — it's something you build. This lesson walks through assembling one, end-to-end, for a realistic task.
The task
Build a classifier that tags incoming customer support messages with:
- category (billing, technical, account, feature_request, other)
- priority (low, medium, high, urgent)
- sentiment (positive, neutral, negative, frustrated)
- suggested_response_type (knowledge_base_reply, human_escalation, automated_ack)
Deployed to a real support queue. Every misclassification has a human cost.
Step 1: Start with the output schema
Schema-first prompting is the cleanest path. Define the exact JSON shape you want before writing any prose:
{
"category": "billing | technical | account | feature_request | other",
"priority": "low | medium | high | urgent",
"sentiment": "positive | neutral | negative | frustrated",
"suggested_response_type": "knowledge_base_reply | human_escalation | automated_ack",
"confidence": 0.0–1.0
}
Step 2: Write the role + scope
You are a triage assistant at Acme, a B2B analytics platform.
Your job is to classify incoming support emails into the structured
schema below. You are rigorous and don't guess — when a message is
ambiguous, lean toward the lower-priority, higher-escalation option.
Note the explicit tie-breaker rule ("when ambiguous, lean toward…"). That's worth ten hedging sentences.
Step 3: Add rules that encode your business policy
Priority rules:
- Anything mentioning production outage, data loss, or billing error → urgent.
- Anything blocking a specific user's work → high.
- Feature requests → low unless customer is enterprise tier.
Sentiment rules:
- Frustrated means more than negative — repeated, escalating, or cursing.
Response-type rules:
- Automated ack only for clearly-FAQ-tier questions. When in doubt → human.
Step 4: Add three diverse few-shot examples
One clear case, one tricky case, one near-miss. These cost tokens — but they're the difference between 85% and 95% accuracy on the tricky category.
Step 5: Specify the output format strictly
Return only a JSON object matching the schema. No prose before or after.
If you have schema-constrained generation available (OpenAI response_format, Anthropic tool use), use it. Belt-and-suspenders: the format instruction plus the schema constraint.
Step 6: Build an eval set
100-200 real messages, hand-labeled by someone who knows the domain. Hold them out. Run every prompt change against them before shipping.
- Precision/recall per category.
- A "ringers" subset — 20 messages that previously broke the system. They must never regress.
Step 7: Instrument production
- Log every (input, classification, confidence, actually-taken-action).
- Sample 1% for human review weekly.
- Any classification below confidence 0.7 → automatic human review.
What production taught this prompt
After 3 months, the "other" category grew too large (a common failure mode — classifier dumps ambiguous cases there). We added a 6th category, "integration," and retrained the prompt examples.
After 6 months, a specific customer wording pattern was reliably misclassified. We added it to the examples. Kept the fix on the regression set.
The capstone lesson
Prompt engineering in production is iterative and boring. There's no magic prompt. There's a prompt + eval set + monitoring + a habit of returning to improve it. The teams that do this well treat the prompt as code. The teams that treat prompts as one-off text strings get surprised.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.What piece is almost always missing from 'prototype' prompts but essential for production ones?
Q2.The tie-breaker rule 'when ambiguous, lean toward the lower-priority, higher-escalation option' encodes…
Continue in this track
More lessons from Prompt Engineering Mastery.
Lesson 8
Grounding with context: docs, examples, tool outputs
Feed the model the right facts at the right time.
Lesson 9
Prompt injection and how to defend against it
What prompt injection is, why it's hard, and what actually works.
Lesson 11
Multi-modal prompting: images, audio, structured inputs
How to prompt vision and audio models without losing the thread.