Prompt caching: latency, cost, and correctness
What to cache, what to vary, and the failure modes cache introduces.
Every major provider now offers prompt caching. If you're not using it, you're overpaying by 40-90% on repeated-context workloads. Here's how it works and where it matters.
What prompt caching is
When you send a long system prompt or a large chunk of context that's stable across many calls, providers can cache the computation at some internal checkpoint. Subsequent requests that share the same prefix start from the cache instead of recomputing.
The savings are in two dimensions:
- Cost: cached input tokens are typically 10-90% cheaper than fresh.
- Latency: cached responses start generating faster because the model skips re-processing cached context.
Provider-specific details (as of 2026)
- OpenAI auto-caches prefixes of ≥1024 tokens on supported models. ~50% cost reduction on cache hits.
- Anthropic requires explicit cache control markers in the request. Up to 90% cost reduction. 5-minute TTL default, longer with extended cache.
- Google Gemini offers cached content as a managed resource with API-level lifecycle. Longer TTLs available.
Check current docs — pricing and TTLs shift.
Where caching earns its keep
- RAG with a stable system prompt. System prompt + tool definitions are identical per query. Cache them.
- Long-running chat. The conversation prefix up to "now" is stable; cache it.
- Agent loops. Each tool call sends the same system + task prompt with a growing scratchpad. Cache the stable head.
- Document Q&A. The document is in context for many questions. Cache the document.
Where caching doesn't help
- One-shot generation where every call has different context. No shared prefix, nothing to cache.
- Very short prompts. Caching overhead eats any savings.
- Highly-varied personalization where the system prompt changes per user. Unless you can isolate the personalized part to the tail, the whole thing misses cache.
Design prompts for cacheability
The rule: put stable context at the start, variable content at the end.
Order:
- System prompt (stable)
- Tool definitions (stable)
- Reference documents (stable or quasi-stable)
- Few-shot examples (stable)
- Conversation history (stable up to "now")
- Current user message (variable)
Flipping any of this so the variable content is earlier means everything after it misses cache.
The monitoring question
Measure cache hit rate per endpoint. If it's below 60% for an endpoint you designed for caching, something's quietly varying — log the prefix hash and diff between two recent requests to find it.
Common drift sources: timestamps in system prompts, user IDs embedded mid-prompt, non-deterministic document ordering in RAG.
Correctness cost
Cached prompts produce identical first-phase output to un-cached (that's the point). But:
- Cache TTLs expire. Your first request after TTL pays the full fresh cost. Plan for it.
- Cache misses due to tiny changes. Even a single-token difference in the prefix invalidates cache. Be deliberate about the boundary.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.To maximize cache hit rate, you should…
Q2.What's a common hidden cache-miss cause in production?
Continue in this track
More lessons from Prompt Engineering Mastery.
Lesson 10
Capstone: a production-grade prompt from scratch
Assemble everything in a single, production-ready prompt with evals.
Lesson 11
Multi-modal prompting: images, audio, structured inputs
How to prompt vision and audio models without losing the thread.
Lesson 13
Version-controlling prompts like code
Git workflows, review rituals, and rollback patterns for prompt files.