Model serving architectures: what runs where
Managed APIs, self-hosted inference, and the hybrid middle ground.
"Where does inference run?" is the single decision that shapes cost, latency, privacy, and operational load. Four patterns worth knowing.
Pattern 1: Managed API (OpenAI, Anthropic, Google, etc.)
You send requests to a provider's endpoint. They handle infrastructure.
Pros:
- Zero infrastructure to manage.
- Always on latest models.
- Scales effortlessly.
- Cheapest to start.
Cons:
- Per-token costs that add up at scale.
- Your data leaves your network. DPAs help; some regulated use cases still can't.
- Latency floor set by the provider. Usually fine, occasionally not.
- Vendor lock-in.
Default choice for most teams, most use cases.
Pattern 2: Managed inference platforms (Together, Fireworks, Replicate, Modal, Baseten)
They host open-weight models for you; you call an API. Bridge between fully-managed and self-hosted.
Pros:
- Access to open-weight models (Llama, Mistral, Qwen) without running GPUs.
- Sometimes cheaper per-token than frontier closed models.
- Often better latency than general APIs because of specialization.
Cons:
- Provider is responsible for uptime but also for model availability.
- Fewer "batteries included" features than OpenAI-grade APIs.
- Quality of open models still below frontier on some tasks.
Good when you need open-weight specifically or want better economics at scale.
Pattern 3: Self-hosted on your infrastructure
You run the GPUs and the inference stack (vLLM, TGI, Triton, sglang).
Pros:
- Data never leaves your network — strongest privacy story.
- Full control over model, quantization, batching, priority.
- At sufficient scale, cheapest per-token.
Cons:
- GPU operations, capacity planning, failure modes you now own.
- Model updates require your work — you're not on automatic "always latest."
- Team cost: 1-3 infra engineers minimum.
Right for regulated industries, organizations with existing ML infrastructure, and very-high-volume workloads.
Pattern 4: Hybrid
Mix of patterns. Common setups:
- Frontier calls go to managed APIs; bulk workloads go to managed inference platforms; sensitive workloads stay on-prem.
- Production traffic on managed APIs; experimentation on self-hosted.
- Routing layer sends specific model classes to specific backends.
Hybrid is the answer for most organizations past the "one product" stage.
The decision framework
Start with managed APIs. Move off only when you have a concrete forcing function:
- Privacy/regulatory requirements that managed can't meet.
- Cost at scale that makes self-hosting break even (usually >$50k/month).
- Latency ceiling that managed can't hit (rare).
- Specific model that's only open-weight.
The gotchas
- Capacity planning on managed. Rate limits can surprise you at scale. Talk to your account team.
- Cold starts on managed inference platforms. First request after idle can be slow. Warm-up endpoints matter.
- Quantization on self-hosted. Running FP16 vs INT8 vs INT4 trades quality for cost. Benchmark, don't guess.
- Version pinning. Whichever pattern, pin the exact model version. "Model updates" can change behavior overnight.
What scale-appropriate looks like
| Scale | Default answer |
|---|---|
| <100k requests/month | Managed API |
| 100k-2M requests/month | Managed API, maybe some tasks on managed inference |
| >2M requests/month | Hybrid likely, with cost modeling driving choices |
| Regulated + medium scale | Managed inference or self-hosted |
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.For most teams in 2026, the default serving pattern is…
Q2.When does self-hosting inference usually cross the line from premature to justified?
Continue in this track
More lessons from Deploying AI at Scale.
Lesson 1
From notebook to service: the first real leap
The hard parts of making a prototype into a real API.
Lesson 3
Designing AI APIs your team can actually use
Rate limits, idempotency, streaming — the API patterns that save you later.
Lesson 4
Cost optimization without sacrificing quality
Where AI spend actually goes and where you can cut without regret.