Version-controlling prompts like code
Git workflows, review rituals, and rollback patterns for prompt files.
Production prompts are code. They need review, versioning, tests, rollback. Treating them as text files in Notion is how teams end up reverting production at midnight.
The maturity ladder
Where does your team live today?
- Level 0: Prompts in Python/TS string literals, scattered through the codebase. Changes merged without special review.
- Level 1: Prompts extracted to dedicated files (
.prompt.md,.prompt.ts) with a clear naming convention. - Level 2: Each prompt has a version number, a changelog entry, and a linked eval run on each change.
- Level 3: Prompts are managed artifacts — versioned deployments, canaries, rollback, telemetry.
Most teams ship at Level 1 and assume Level 3 happens later. It doesn't, unless you force it.
Why version prompts
- Traceability. When a production incident starts, you need to know: which prompt version was live?
- Rollback. A bad prompt change is rolled back the same way as bad code — revert the commit, redeploy.
- A/B testing. Two versions in production, traffic split, measured outcomes.
- Eval history. "This prompt version scored 87% on the eval set — are we doing better now?"
The file-based pattern
Keep each prompt in its own file:
prompts/
triage-classifier.v2.md
triage-classifier.v1.md ← retained for reference
summarize-meeting.v1.md
changelog.md
The version suffix is deliberate. When you bump to v3, you keep v2 on disk until it stops serving any traffic. Makes rollback one line of config.
Frontmatter for metadata
Use YAML frontmatter for machine-readable context:
---
name: triage-classifier
version: 2.0.0
owner: @team-support
eval: evals/triage-classifier.yml
description: Classifies inbound support messages.
---
You are a triage assistant at Acme…
This enables tooling (eval runner, linter, deployment system) without clutter.
Prompt review in PRs
Prompts should be diffed and reviewed like code. Two changes to watch for:
- Subtle rule changes. A single rule removed or reworded. Reviewer should ask: what does this affect, and did evals cover it?
- Large rewrites. Easier to review the diff in two steps: structural changes first, then content changes.
Prompts benefit from review by someone who didn't write them. The writer always thinks it's clear.
Deployment patterns
- Baked into binary. Simplest. Change = code deploy. Safe and slow.
- Dynamic config. Prompt loaded at request time from a config store. Hot-reload. Can ship prompt changes without redeploying code.
- Managed service. Some teams run prompts through a platform (PromptLayer, Humanloop, LangSmith) that handles deployment + A/B.
Choose based on how often prompts change. If you update daily, bake-in is pain. If weekly or less, bake-in is fine and simpler.
What the changelog should say
## v2.0.0 — 2026-04-12
- Add "frustrated" as distinct from "negative" sentiment.
- Tightened priority rule for billing issues.
- Eval: precision +3pp, recall unchanged. Ringers all passing.
Link to the eval run. Link to the PR. Future-you will thank present-you.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.A production-grade prompt artifact typically has…
Q2.The best time to write an eval for a new prompt is…