Capstone: a production-grade prompt from scratch

A production-ready prompt isn't something you write — it's something you build. This lesson walks through assembling one, end-to-end, for a realistic task.

The task

Build a classifier that tags incoming customer support messages with:

category (billing, technical, account, feature_request, other)
priority (low, medium, high, urgent)
sentiment (positive, neutral, negative, frustrated)
suggested_response_type (knowledge_base_reply, human_escalation, automated_ack)

Deployed to a real support queue. Every misclassification has a human cost.

Step 1: Start with the output schema

Schema-first prompting is the cleanest path. Define the exact JSON shape you want before writing any prose:

{
  "category": "billing | technical | account | feature_request | other",
  "priority": "low | medium | high | urgent",
  "sentiment": "positive | neutral | negative | frustrated",
  "suggested_response_type": "knowledge_base_reply | human_escalation | automated_ack",
  "confidence": 0.0–1.0
}

Step 2: Write the role + scope

You are a triage assistant at Acme, a B2B analytics platform.
Your job is to classify incoming support emails into the structured
schema below. You are rigorous and don't guess — when a message is
ambiguous, lean toward the lower-priority, higher-escalation option.

Note the explicit tie-breaker rule ("when ambiguous, lean toward…"). That's worth ten hedging sentences.

Step 3: Add rules that encode your business policy

Priority rules:
- Anything mentioning production outage, data loss, or billing error → urgent.
- Anything blocking a specific user's work → high.
- Feature requests → low unless customer is enterprise tier.

Sentiment rules:
- Frustrated means more than negative — repeated, escalating, or cursing.

Response-type rules:
- Automated ack only for clearly-FAQ-tier questions. When in doubt → human.

Step 4: Add three diverse few-shot examples

One clear case, one tricky case, one near-miss. These cost tokens — but they're the difference between 85% and 95% accuracy on the tricky category.

Step 5: Specify the output format strictly

Return only a JSON object matching the schema. No prose before or after.

If you have schema-constrained generation available (OpenAI response_format, Anthropic tool use), use it. Belt-and-suspenders: the format instruction plus the schema constraint.

Step 6: Build an eval set

100-200 real messages, hand-labeled by someone who knows the domain. Hold them out. Run every prompt change against them before shipping.

Precision/recall per category.
A "ringers" subset — 20 messages that previously broke the system. They must never regress.

Step 7: Instrument production

Log every (input, classification, confidence, actually-taken-action).
Sample 1% for human review weekly.
Any classification below confidence 0.7 → automatic human review.

What production taught this prompt

After 3 months, the "other" category grew too large (a common failure mode — classifier dumps ambiguous cases there). We added a 6th category, "integration," and retrained the prompt examples.

After 6 months, a specific customer wording pattern was reliably misclassified. We added it to the examples. Kept the fix on the regression set.

The capstone lesson

Prompt engineering in production is iterative and boring. There's no magic prompt. There's a prompt + eval set + monitoring + a habit of returning to improve it. The teams that do this well treat the prompt as code. The teams that treat prompts as one-off text strings get surprised.

A production-ready prompt isn't something you write — it's something you build. This lesson walks through assembling one, end-to-end, for a realistic task.

The task

Build a classifier that tags incoming customer support messages with:

category (billing, technical, account, feature_request, other)
priority (low, medium, high, urgent)
sentiment (positive, neutral, negative, frustrated)
suggested_response_type (knowledge_base_reply, human_escalation, automated_ack)

Deployed to a real support queue. Every misclassification has a human cost.

Step 1: Start with the output schema

Schema-first prompting is the cleanest path. Define the exact JSON shape you want before writing any prose:

{
  "category": "billing | technical | account | feature_request | other",
  "priority": "low | medium | high | urgent",
  "sentiment": "positive | neutral | negative | frustrated",
  "suggested_response_type": "knowledge_base_reply | human_escalation | automated_ack",
  "confidence": 0.0–1.0
}

Step 2: Write the role + scope

You are a triage assistant at Acme, a B2B analytics platform.
Your job is to classify incoming support emails into the structured
schema below. You are rigorous and don't guess — when a message is
ambiguous, lean toward the lower-priority, higher-escalation option.

Note the explicit tie-breaker rule ("when ambiguous, lean toward…"). That's worth ten hedging sentences.

Step 3: Add rules that encode your business policy

Priority rules:
- Anything mentioning production outage, data loss, or billing error → urgent.
- Anything blocking a specific user's work → high.
- Feature requests → low unless customer is enterprise tier.

Sentiment rules:
- Frustrated means more than negative — repeated, escalating, or cursing.

Response-type rules:
- Automated ack only for clearly-FAQ-tier questions. When in doubt → human.

Step 4: Add three diverse few-shot examples

One clear case, one tricky case, one near-miss. These cost tokens — but they're the difference between 85% and 95% accuracy on the tricky category.

Step 5: Specify the output format strictly

Return only a JSON object matching the schema. No prose before or after.

If you have schema-constrained generation available (OpenAI response_format, Anthropic tool use), use it. Belt-and-suspenders: the format instruction plus the schema constraint.

Step 6: Build an eval set

100-200 real messages, hand-labeled by someone who knows the domain. Hold them out. Run every prompt change against them before shipping.

Precision/recall per category.
A "ringers" subset — 20 messages that previously broke the system. They must never regress.

Step 7: Instrument production

Log every (input, classification, confidence, actually-taken-action).
Sample 1% for human review weekly.
Any classification below confidence 0.7 → automatic human review.

What production taught this prompt

After 3 months, the "other" category grew too large (a common failure mode — classifier dumps ambiguous cases there). We added a 6th category, "integration," and retrained the prompt examples.

After 6 months, a specific customer wording pattern was reliably misclassified. We added it to the examples. Kept the fix on the regression set.

Capstone: a production-grade prompt from scratch

The task

Step 1: Start with the output schema

Step 2: Write the role + scope

Step 3: Add rules that encode your business policy

Step 4: Add three diverse few-shot examples

Step 5: Specify the output format strictly

Step 6: Build an eval set

Step 7: Instrument production

What production taught this prompt

The capstone lesson

2-question self-check

Continue in this track

Capstone: a production-grade prompt from scratch

The task

Step 1: Start with the output schema

Step 2: Write the role + scope

Step 3: Add rules that encode your business policy

Step 4: Add three diverse few-shot examples

Step 5: Specify the output format strictly

Step 6: Build an eval set

Step 7: Instrument production

What production taught this prompt

The capstone lesson

2-question self-check

Continue in this track