Prompt caching: latency, cost, and correctness

Every major provider now offers prompt caching. If you're not using it, you're overpaying by 40-90% on repeated-context workloads. Here's how it works and where it matters.

What prompt caching is

When you send a long system prompt or a large chunk of context that's stable across many calls, providers can cache the computation at some internal checkpoint. Subsequent requests that share the same prefix start from the cache instead of recomputing.

The savings are in two dimensions:

Cost: cached input tokens are typically 10-90% cheaper than fresh.
Latency: cached responses start generating faster because the model skips re-processing cached context.

Provider-specific details (as of 2026)

OpenAI auto-caches prefixes of ≥1024 tokens on supported models. ~50% cost reduction on cache hits.
Anthropic requires explicit cache control markers in the request. Up to 90% cost reduction. 5-minute TTL default, longer with extended cache.
Google Gemini offers cached content as a managed resource with API-level lifecycle. Longer TTLs available.

Check current docs — pricing and TTLs shift.

Where caching earns its keep

RAG with a stable system prompt. System prompt + tool definitions are identical per query. Cache them.
Long-running chat. The conversation prefix up to "now" is stable; cache it.
Agent loops. Each tool call sends the same system + task prompt with a growing scratchpad. Cache the stable head.
Document Q&A. The document is in context for many questions. Cache the document.

Where caching doesn't help

One-shot generation where every call has different context. No shared prefix, nothing to cache.
Very short prompts. Caching overhead eats any savings.
Highly-varied personalization where the system prompt changes per user. Unless you can isolate the personalized part to the tail, the whole thing misses cache.

Design prompts for cacheability

The rule: put stable context at the start, variable content at the end.

Order:

System prompt (stable)
Tool definitions (stable)
Reference documents (stable or quasi-stable)
Few-shot examples (stable)
Conversation history (stable up to "now")
Current user message (variable)

Flipping any of this so the variable content is earlier means everything after it misses cache.

The monitoring question

Measure cache hit rate per endpoint. If it's below 60% for an endpoint you designed for caching, something's quietly varying — log the prefix hash and diff between two recent requests to find it.

Common drift sources: timestamps in system prompts, user IDs embedded mid-prompt, non-deterministic document ordering in RAG.

Correctness cost

Cached prompts produce identical first-phase output to un-cached (that's the point). But:

Cache TTLs expire. Your first request after TTL pays the full fresh cost. Plan for it.
Cache misses due to tiny changes. Even a single-token difference in the prefix invalidates cache. Be deliberate about the boundary.

Every major provider now offers prompt caching. If you're not using it, you're overpaying by 40-90% on repeated-context workloads. Here's how it works and where it matters.

What prompt caching is

The savings are in two dimensions:

Cost: cached input tokens are typically 10-90% cheaper than fresh.
Latency: cached responses start generating faster because the model skips re-processing cached context.

Provider-specific details (as of 2026)

OpenAI auto-caches prefixes of ≥1024 tokens on supported models. ~50% cost reduction on cache hits.
Anthropic requires explicit cache control markers in the request. Up to 90% cost reduction. 5-minute TTL default, longer with extended cache.
Google Gemini offers cached content as a managed resource with API-level lifecycle. Longer TTLs available.

Check current docs — pricing and TTLs shift.

Where caching earns its keep

RAG with a stable system prompt. System prompt + tool definitions are identical per query. Cache them.
Long-running chat. The conversation prefix up to "now" is stable; cache it.
Agent loops. Each tool call sends the same system + task prompt with a growing scratchpad. Cache the stable head.
Document Q&A. The document is in context for many questions. Cache the document.

Where caching doesn't help

One-shot generation where every call has different context. No shared prefix, nothing to cache.
Very short prompts. Caching overhead eats any savings.
Highly-varied personalization where the system prompt changes per user. Unless you can isolate the personalized part to the tail, the whole thing misses cache.

Design prompts for cacheability

The rule: put stable context at the start, variable content at the end.

Order:

System prompt (stable)
Tool definitions (stable)
Reference documents (stable or quasi-stable)
Few-shot examples (stable)
Conversation history (stable up to "now")
Current user message (variable)

Flipping any of this so the variable content is earlier means everything after it misses cache.

The monitoring question

Measure cache hit rate per endpoint. If it's below 60% for an endpoint you designed for caching, something's quietly varying — log the prefix hash and diff between two recent requests to find it.

Common drift sources: timestamps in system prompts, user IDs embedded mid-prompt, non-deterministic document ordering in RAG.

Correctness cost

Cached prompts produce identical first-phase output to un-cached (that's the point). But:

Cache TTLs expire. Your first request after TTL pays the full fresh cost. Plan for it.
Cache misses due to tiny changes. Even a single-token difference in the prefix invalidates cache. Be deliberate about the boundary.

Prompt caching: latency, cost, and correctness

What prompt caching is

Provider-specific details (as of 2026)

Where caching earns its keep

Where caching doesn't help

Design prompts for cacheability

The monitoring question

Correctness cost

2-question self-check

Continue in this track

Prompt caching: latency, cost, and correctness

What prompt caching is

Provider-specific details (as of 2026)

Where caching earns its keep

Where caching doesn't help

Design prompts for cacheability

The monitoring question

Correctness cost

2-question self-check

Continue in this track