Scaling inference: the playbook at 10k, 100k, and 1M users

At 10k requests/day your managed API plan is fine. At 1M/day you need real architecture. Here's what breaks at each threshold and what to do about it.

10k requests/day (or less)

You're a happy managed-API customer. Any team of 1-5 engineers can run this with near-zero infra.

Watch for: cost per request, rate limits on spikes. Otherwise nothing exotic needed.

100k requests/day

First thing that breaks: provider rate limits during peak traffic. Talk to your account team and get limits raised. It's free and common.

Add: request queueing at your edge. When you hit a limit, queue briefly and retry rather than 500ing the user.

Consider: per-endpoint model routing. Don't use frontier for every task. Light classification goes to cheaper models.

500k-1M requests/day

Cost becomes a board-visible line item. Multiple optimizations pay for themselves:

Caching. Prompt caching saves 50-90% on stable prefixes.
Model routing. 60% of traffic shifts to a smaller model.
Batch API for async workloads. 50% cost reduction, 24h turnaround.

Infrastructure: still no self-hosting usually. But you want:

Observability with per-endpoint breakdowns.
Circuit breakers when provider degrades.
Graceful failover between providers.

2M-10M requests/day

You can now do the math on self-hosting vs managed. Usually:

Frontier stays managed (you can't match their quality at reasonable cost).
Commodity inference (classification, summarization, embeddings) moves to managed inference platforms or self-hosted.
Hot tasks (retrieval, reranking, embeddings) get cached aggressively.

Engineering: dedicated infra engineer full-time. Probably a platform team forming around AI infrastructure.

50M+ requests/day

You have some combination of:

Self-hosted inference on GPUs.
Multiple providers as backends, routed dynamically.
Embeddings and reranking running on-prem.
Batch workloads on spot infrastructure.
Fine-tuned models for specific tasks.

The operational complexity is a real team. Three+ infra engineers, an ML engineer, an SRE or two. This is a serious investment; the payoff is cost-per-request that managed APIs can't match.

What breaks at each scale

10k: Nothing, usually.
100k: Rate limits; surprise cost spikes.
500k: Tail latency (p99s drift up under load); model routing complexity.
2M: Multi-provider orchestration; batch-vs-realtime tradeoff.
10M+: Cold starts; GPU availability; maintenance windows; team cost.

Capacity planning for AI

Two new variables vs traditional capacity planning:

Tokens-per-request varies wildly — one user asks "What's 2+2?"; another pastes a 50k-token doc. Plan for the P95, not the average.
Latency is request-dependent — generating 100 tokens takes ~0.5s; 4000 tokens takes ~30s. Your concurrency calc is tokens-per-second, not requests-per-second.

Model your infrastructure in tokens, not requests.

Batching vs. streaming

Realtime streaming: interactive chat, voice, anything user-facing. Managed APIs handle this well.
Batch inference: tagging, classification, embedding generation at scale. Batch APIs exist for a reason — much cheaper.
Continuous batching: a key optimization in self-hosted inference (vLLM, TGI). Different requests pack into shared model passes.

Choose per workload; don't mix paradigms in a single system.

What to remember

Most teams over-engineer inference infrastructure too early. Three rules:

Stay managed as long as economics allow. Crossing off managed saves engineer-months in exchange for per-token savings. Measure before jumping.
Optimize prompts before infrastructure. A 40% shorter prompt beats a 20% cheaper backend.
Instrument first. You can't scale what you can't see.

At 10k requests/day your managed API plan is fine. At 1M/day you need real architecture. Here's what breaks at each threshold and what to do about it.

10k requests/day (or less)

You're a happy managed-API customer. Any team of 1-5 engineers can run this with near-zero infra.

Watch for: cost per request, rate limits on spikes. Otherwise nothing exotic needed.

100k requests/day

First thing that breaks: provider rate limits during peak traffic. Talk to your account team and get limits raised. It's free and common.

Add: request queueing at your edge. When you hit a limit, queue briefly and retry rather than 500ing the user.

Consider: per-endpoint model routing. Don't use frontier for every task. Light classification goes to cheaper models.

500k-1M requests/day

Cost becomes a board-visible line item. Multiple optimizations pay for themselves:

Caching. Prompt caching saves 50-90% on stable prefixes.
Model routing. 60% of traffic shifts to a smaller model.
Batch API for async workloads. 50% cost reduction, 24h turnaround.

Infrastructure: still no self-hosting usually. But you want:

Observability with per-endpoint breakdowns.
Circuit breakers when provider degrades.
Graceful failover between providers.

2M-10M requests/day

You can now do the math on self-hosting vs managed. Usually:

Frontier stays managed (you can't match their quality at reasonable cost).
Commodity inference (classification, summarization, embeddings) moves to managed inference platforms or self-hosted.
Hot tasks (retrieval, reranking, embeddings) get cached aggressively.

Engineering: dedicated infra engineer full-time. Probably a platform team forming around AI infrastructure.

50M+ requests/day

You have some combination of:

Self-hosted inference on GPUs.
Multiple providers as backends, routed dynamically.
Embeddings and reranking running on-prem.
Batch workloads on spot infrastructure.
Fine-tuned models for specific tasks.

The operational complexity is a real team. Three+ infra engineers, an ML engineer, an SRE or two. This is a serious investment; the payoff is cost-per-request that managed APIs can't match.

What breaks at each scale

10k: Nothing, usually.
100k: Rate limits; surprise cost spikes.
500k: Tail latency (p99s drift up under load); model routing complexity.
2M: Multi-provider orchestration; batch-vs-realtime tradeoff.
10M+: Cold starts; GPU availability; maintenance windows; team cost.

Capacity planning for AI

Two new variables vs traditional capacity planning:

Tokens-per-request varies wildly — one user asks "What's 2+2?"; another pastes a 50k-token doc. Plan for the P95, not the average.
Latency is request-dependent — generating 100 tokens takes ~0.5s; 4000 tokens takes ~30s. Your concurrency calc is tokens-per-second, not requests-per-second.

Model your infrastructure in tokens, not requests.

Batching vs. streaming

Realtime streaming: interactive chat, voice, anything user-facing. Managed APIs handle this well.
Batch inference: tagging, classification, embedding generation at scale. Batch APIs exist for a reason — much cheaper.
Continuous batching: a key optimization in self-hosted inference (vLLM, TGI). Different requests pack into shared model passes.

Choose per workload; don't mix paradigms in a single system.

What to remember

Most teams over-engineer inference infrastructure too early. Three rules:

Stay managed as long as economics allow. Crossing off managed saves engineer-months in exchange for per-token savings. Measure before jumping.
Optimize prompts before infrastructure. A 40% shorter prompt beats a 20% cheaper backend.
Instrument first. You can't scale what you can't see.

Scaling inference: the playbook at 10k, 100k, and 1M users

10k requests/day (or less)

100k requests/day

500k-1M requests/day

2M-10M requests/day

50M+ requests/day

What breaks at each scale

Capacity planning for AI

Batching vs. streaming

What to remember

2-question self-check

Continue in this track

Scaling inference: the playbook at 10k, 100k, and 1M users

10k requests/day (or less)

100k requests/day

500k-1M requests/day

2M-10M requests/day

50M+ requests/day

What breaks at each scale

Capacity planning for AI

Batching vs. streaming

What to remember

2-question self-check

Continue in this track