Version-controlling prompts like code

Production prompts are code. They need review, versioning, tests, rollback. Treating them as text files in Notion is how teams end up reverting production at midnight.

The maturity ladder

Where does your team live today?

Level 0: Prompts in Python/TS string literals, scattered through the codebase. Changes merged without special review.
Level 1: Prompts extracted to dedicated files (.prompt.md, .prompt.ts) with a clear naming convention.
Level 2: Each prompt has a version number, a changelog entry, and a linked eval run on each change.
Level 3: Prompts are managed artifacts — versioned deployments, canaries, rollback, telemetry.

Most teams ship at Level 1 and assume Level 3 happens later. It doesn't, unless you force it.

Why version prompts

Traceability. When a production incident starts, you need to know: which prompt version was live?
Rollback. A bad prompt change is rolled back the same way as bad code — revert the commit, redeploy.
A/B testing. Two versions in production, traffic split, measured outcomes.
Eval history. "This prompt version scored 87% on the eval set — are we doing better now?"

The file-based pattern

Keep each prompt in its own file:

prompts/
  triage-classifier.v2.md
  triage-classifier.v1.md  ← retained for reference
  summarize-meeting.v1.md
  changelog.md

The version suffix is deliberate. When you bump to v3, you keep v2 on disk until it stops serving any traffic. Makes rollback one line of config.

Frontmatter for metadata

Use YAML frontmatter for machine-readable context:

---
name: triage-classifier
version: 2.0.0
owner: @team-support
eval: evals/triage-classifier.yml
description: Classifies inbound support messages.
---

You are a triage assistant at Acme…

This enables tooling (eval runner, linter, deployment system) without clutter.

Prompt review in PRs

Prompts should be diffed and reviewed like code. Two changes to watch for:

Subtle rule changes. A single rule removed or reworded. Reviewer should ask: what does this affect, and did evals cover it?
Large rewrites. Easier to review the diff in two steps: structural changes first, then content changes.

Prompts benefit from review by someone who didn't write them. The writer always thinks it's clear.

Deployment patterns

Baked into binary. Simplest. Change = code deploy. Safe and slow.
Dynamic config. Prompt loaded at request time from a config store. Hot-reload. Can ship prompt changes without redeploying code.
Managed service. Some teams run prompts through a platform (PromptLayer, Humanloop, LangSmith) that handles deployment + A/B.

Choose based on how often prompts change. If you update daily, bake-in is pain. If weekly or less, bake-in is fine and simpler.

What the changelog should say

## v2.0.0 — 2026-04-12
- Add "frustrated" as distinct from "negative" sentiment.
- Tightened priority rule for billing issues.
- Eval: precision +3pp, recall unchanged. Ringers all passing.

Link to the eval run. Link to the PR. Future-you will thank present-you.

Production prompts are code. They need review, versioning, tests, rollback. Treating them as text files in Notion is how teams end up reverting production at midnight.

The maturity ladder

Where does your team live today?

Level 0: Prompts in Python/TS string literals, scattered through the codebase. Changes merged without special review.
Level 1: Prompts extracted to dedicated files (.prompt.md, .prompt.ts) with a clear naming convention.
Level 2: Each prompt has a version number, a changelog entry, and a linked eval run on each change.
Level 3: Prompts are managed artifacts — versioned deployments, canaries, rollback, telemetry.

Most teams ship at Level 1 and assume Level 3 happens later. It doesn't, unless you force it.

Why version prompts

Traceability. When a production incident starts, you need to know: which prompt version was live?
Rollback. A bad prompt change is rolled back the same way as bad code — revert the commit, redeploy.
A/B testing. Two versions in production, traffic split, measured outcomes.
Eval history. "This prompt version scored 87% on the eval set — are we doing better now?"

The file-based pattern

Keep each prompt in its own file:

prompts/
  triage-classifier.v2.md
  triage-classifier.v1.md  ← retained for reference
  summarize-meeting.v1.md
  changelog.md

The version suffix is deliberate. When you bump to v3, you keep v2 on disk until it stops serving any traffic. Makes rollback one line of config.

Frontmatter for metadata

Use YAML frontmatter for machine-readable context:

---
name: triage-classifier
version: 2.0.0
owner: @team-support
eval: evals/triage-classifier.yml
description: Classifies inbound support messages.
---

You are a triage assistant at Acme…

This enables tooling (eval runner, linter, deployment system) without clutter.

Prompt review in PRs

Prompts should be diffed and reviewed like code. Two changes to watch for:

Subtle rule changes. A single rule removed or reworded. Reviewer should ask: what does this affect, and did evals cover it?
Large rewrites. Easier to review the diff in two steps: structural changes first, then content changes.

Prompts benefit from review by someone who didn't write them. The writer always thinks it's clear.

Deployment patterns

Baked into binary. Simplest. Change = code deploy. Safe and slow.
Dynamic config. Prompt loaded at request time from a config store. Hot-reload. Can ship prompt changes without redeploying code.
Managed service. Some teams run prompts through a platform (PromptLayer, Humanloop, LangSmith) that handles deployment + A/B.

Choose based on how often prompts change. If you update daily, bake-in is pain. If weekly or less, bake-in is fine and simpler.

What the changelog should say

## v2.0.0 — 2026-04-12
- Add "frustrated" as distinct from "negative" sentiment.
- Tightened priority rule for billing issues.
- Eval: precision +3pp, recall unchanged. Ringers all passing.

Link to the eval run. Link to the PR. Future-you will thank present-you.

Version-controlling prompts like code

The maturity ladder

Why version prompts

The file-based pattern

Frontmatter for metadata

Prompt review in PRs

Deployment patterns

What the changelog should say

2-question self-check

Continue in this track

Version-controlling prompts like code

The maturity ladder

Why version prompts

The file-based pattern

Frontmatter for metadata

Prompt review in PRs

Deployment patterns

What the changelog should say

2-question self-check

Continue in this track