Scaling inference: the playbook at 10k, 100k, and 1M users
What breaks first, what to batch, and when to switch providers.
At 10k requests/day your managed API plan is fine. At 1M/day you need real architecture. Here's what breaks at each threshold and what to do about it.
10k requests/day (or less)
You're a happy managed-API customer. Any team of 1-5 engineers can run this with near-zero infra.
Watch for: cost per request, rate limits on spikes. Otherwise nothing exotic needed.
100k requests/day
First thing that breaks: provider rate limits during peak traffic. Talk to your account team and get limits raised. It's free and common.
Add: request queueing at your edge. When you hit a limit, queue briefly and retry rather than 500ing the user.
Consider: per-endpoint model routing. Don't use frontier for every task. Light classification goes to cheaper models.
500k-1M requests/day
Cost becomes a board-visible line item. Multiple optimizations pay for themselves:
- Caching. Prompt caching saves 50-90% on stable prefixes.
- Model routing. 60% of traffic shifts to a smaller model.
- Batch API for async workloads. 50% cost reduction, 24h turnaround.
Infrastructure: still no self-hosting usually. But you want:
- Observability with per-endpoint breakdowns.
- Circuit breakers when provider degrades.
- Graceful failover between providers.
2M-10M requests/day
You can now do the math on self-hosting vs managed. Usually:
- Frontier stays managed (you can't match their quality at reasonable cost).
- Commodity inference (classification, summarization, embeddings) moves to managed inference platforms or self-hosted.
- Hot tasks (retrieval, reranking, embeddings) get cached aggressively.
Engineering: dedicated infra engineer full-time. Probably a platform team forming around AI infrastructure.
50M+ requests/day
You have some combination of:
- Self-hosted inference on GPUs.
- Multiple providers as backends, routed dynamically.
- Embeddings and reranking running on-prem.
- Batch workloads on spot infrastructure.
- Fine-tuned models for specific tasks.
The operational complexity is a real team. Three+ infra engineers, an ML engineer, an SRE or two. This is a serious investment; the payoff is cost-per-request that managed APIs can't match.
What breaks at each scale
- 10k: Nothing, usually.
- 100k: Rate limits; surprise cost spikes.
- 500k: Tail latency (p99s drift up under load); model routing complexity.
- 2M: Multi-provider orchestration; batch-vs-realtime tradeoff.
- 10M+: Cold starts; GPU availability; maintenance windows; team cost.
Capacity planning for AI
Two new variables vs traditional capacity planning:
- Tokens-per-request varies wildly — one user asks "What's 2+2?"; another pastes a 50k-token doc. Plan for the P95, not the average.
- Latency is request-dependent — generating 100 tokens takes ~0.5s; 4000 tokens takes ~30s. Your concurrency calc is tokens-per-second, not requests-per-second.
Model your infrastructure in tokens, not requests.
Batching vs. streaming
- Realtime streaming: interactive chat, voice, anything user-facing. Managed APIs handle this well.
- Batch inference: tagging, classification, embedding generation at scale. Batch APIs exist for a reason — much cheaper.
- Continuous batching: a key optimization in self-hosted inference (vLLM, TGI). Different requests pack into shared model passes.
Choose per workload; don't mix paradigms in a single system.
What to remember
Most teams over-engineer inference infrastructure too early. Three rules:
- Stay managed as long as economics allow. Crossing off managed saves engineer-months in exchange for per-token savings. Measure before jumping.
- Optimize prompts before infrastructure. A 40% shorter prompt beats a 20% cheaper backend.
- Instrument first. You can't scale what you can't see.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.At what traffic scale does self-hosting typically become worth evaluating?
Q2.When capacity planning for AI, the right unit of measurement is…
Continue in this track
More lessons from Deploying AI at Scale.
Lesson 5
Observability for AI systems
What to trace, log, and alert on when the unit of work is a generation.
Lesson 6
Handling failure gracefully: timeouts, fallbacks, degraded modes
Your AI service will fail. These are the patterns for surviving it.
Lesson 7
A/B testing generative output
How to run valid experiments when every response is different.