Cost optimization without sacrificing quality
Where AI spend actually goes and where you can cut without regret.
Every AI system has 3-5x cost bloat that can be cut without quality loss. Not by using cheaper models — by being smarter about what you spend tokens on.
Audit first: where are your tokens going?
Before optimizing, instrument. You need per-request:
- Input tokens and output tokens.
- Which stage (retrieval context? system prompt? chat history? user input?).
- Which model.
One week of data answers "where's the money going" in a way intuition can't.
The seven levers, roughly in order of impact
1. Prompt length. Most teams have system prompts that grew over time. Strip them. Often 30-40% of input tokens per call.
2. Context dilution. RAG returning too many chunks; chat history carried forever. Typical waste: 40% of input tokens on content the model never uses for this query.
3. Output verbosity. Model writes 400 words when 100 would do. Add an explicit length cap in the prompt. Easy 20-40% savings on generation.
4. Model selection. Not every task needs the frontier. Classification on GPT-4o vs GPT-4o-mini: 15-20× cost difference, often within 2-3% accuracy. Routing dispatches simple tasks to cheap models.
5. Caching. Prompt caching on repeated contexts saves 50-90% on qualifying calls. Shockingly few teams use it.
6. Batching. Non-urgent work can use provider batch APIs. OpenAI's batch API is 50% cheaper, 24-hour turnaround.
7. Avoiding calls entirely. Can a task be done without AI? Deterministic rules first; AI fallback. Often the bulk of savings.
The specific wins
The "Is this task simple?" classifier. Upfront cheap classifier routes: 60% to a small model, 30% to mid-tier, 10% to frontier. Saves 50-70% on total compute with minimal quality loss.
Chunk compression for RAG. Instead of sending 10 chunks of 500 tokens each, compress them first (smaller model) into 2000 tokens. Works for read-heavy workloads.
Truncate chat history aggressively. Rolling summary + last 4 messages beats "keep everything" for most chat UX. Users rarely reference messages from 20 turns ago.
Short system prompts. Every extra token in your system prompt is paid on every call. A 1,500-token system prompt × 1M calls = 1.5B tokens × $5/M = $7,500 before any user content.
The cost-quality curve
Do the measurement: for each optimization, measure both the cost reduction and the quality delta on your eval set. The useful moves are Pareto-improving (cheaper and same quality, or slightly worse quality for much lower cost).
Stop optimizing when you're cutting quality to save pennies. Know your floor.
What not to do
- Aggressive quantization on self-hosted without eval. Goes wrong silently.
- Switching models on feel. Always eval before and after.
- Pre-optimizing. Ship with managed APIs, measure real usage, optimize what actually costs.
- Cache everything. Cache invalidation bugs cost more than cache savings. Be deliberate.
The monthly cost review
For any production AI system, monthly:
- Total spend, per model.
- Requests per user, per day — are power users dominating?
- Cost per successful outcome (not per request) — the metric that ties back to business value.
The meta-lesson
AI costs compound silently. A prompt that was fine at launch is expensive at 100x scale. Build the habit of measuring before it becomes an incident.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.The highest-ROI cost optimizations usually come from…
Q2.Model routing means…
Continue in this track
More lessons from Deploying AI at Scale.
Lesson 2
Model serving architectures: what runs where
Managed APIs, self-hosted inference, and the hybrid middle ground.
Lesson 3
Designing AI APIs your team can actually use
Rate limits, idempotency, streaming — the API patterns that save you later.
Lesson 5
Observability for AI systems
What to trace, log, and alert on when the unit of work is a generation.