Model serving architectures: what runs where

"Where does inference run?" is the single decision that shapes cost, latency, privacy, and operational load. Four patterns worth knowing.

Pattern 1: Managed API (OpenAI, Anthropic, Google, etc.)

You send requests to a provider's endpoint. They handle infrastructure.

Pros:

Zero infrastructure to manage.
Always on latest models.
Scales effortlessly.
Cheapest to start.

Cons:

Per-token costs that add up at scale.
Your data leaves your network. DPAs help; some regulated use cases still can't.
Latency floor set by the provider. Usually fine, occasionally not.
Vendor lock-in.

Default choice for most teams, most use cases.

They host open-weight models for you; you call an API. Bridge between fully-managed and self-hosted.

Pros:

Access to open-weight models (Llama, Mistral, Qwen) without running GPUs.
Sometimes cheaper per-token than frontier closed models.
Often better latency than general APIs because of specialization.

Cons:

Provider is responsible for uptime but also for model availability.
Fewer "batteries included" features than OpenAI-grade APIs.
Quality of open models still below frontier on some tasks.

Good when you need open-weight specifically or want better economics at scale.

Pattern 3: Self-hosted on your infrastructure

You run the GPUs and the inference stack (vLLM, TGI, Triton, sglang).

Pros:

Data never leaves your network — strongest privacy story.
Full control over model, quantization, batching, priority.
At sufficient scale, cheapest per-token.

Cons:

GPU operations, capacity planning, failure modes you now own.
Model updates require your work — you're not on automatic "always latest."
Team cost: 1-3 infra engineers minimum.

Right for regulated industries, organizations with existing ML infrastructure, and very-high-volume workloads.

Pattern 4: Hybrid

Mix of patterns. Common setups:

Frontier calls go to managed APIs; bulk workloads go to managed inference platforms; sensitive workloads stay on-prem.
Production traffic on managed APIs; experimentation on self-hosted.
Routing layer sends specific model classes to specific backends.

Hybrid is the answer for most organizations past the "one product" stage.

The decision framework

Start with managed APIs. Move off only when you have a concrete forcing function:

Privacy/regulatory requirements that managed can't meet.
Cost at scale that makes self-hosting break even (usually >$50k/month).
Latency ceiling that managed can't hit (rare).
Specific model that's only open-weight.

The gotchas

Capacity planning on managed. Rate limits can surprise you at scale. Talk to your account team.
Cold starts on managed inference platforms. First request after idle can be slow. Warm-up endpoints matter.
Quantization on self-hosted. Running FP16 vs INT8 vs INT4 trades quality for cost. Benchmark, don't guess.
Version pinning. Whichever pattern, pin the exact model version. "Model updates" can change behavior overnight.

What scale-appropriate looks like

Scale	Default answer
<100k requests/month	Managed API
100k-2M requests/month	Managed API, maybe some tasks on managed inference
>2M requests/month	Hybrid likely, with cost modeling driving choices
Regulated + medium scale	Managed inference or self-hosted

"Where does inference run?" is the single decision that shapes cost, latency, privacy, and operational load. Four patterns worth knowing.

Pattern 1: Managed API (OpenAI, Anthropic, Google, etc.)

You send requests to a provider's endpoint. They handle infrastructure.

Pros:

Zero infrastructure to manage.
Always on latest models.
Scales effortlessly.
Cheapest to start.

Cons:

Per-token costs that add up at scale.
Your data leaves your network. DPAs help; some regulated use cases still can't.
Latency floor set by the provider. Usually fine, occasionally not.
Vendor lock-in.

Default choice for most teams, most use cases.

They host open-weight models for you; you call an API. Bridge between fully-managed and self-hosted.

Pros:

Access to open-weight models (Llama, Mistral, Qwen) without running GPUs.
Sometimes cheaper per-token than frontier closed models.
Often better latency than general APIs because of specialization.

Cons:

Provider is responsible for uptime but also for model availability.
Fewer "batteries included" features than OpenAI-grade APIs.
Quality of open models still below frontier on some tasks.

Good when you need open-weight specifically or want better economics at scale.

Pattern 3: Self-hosted on your infrastructure

You run the GPUs and the inference stack (vLLM, TGI, Triton, sglang).

Pros:

Data never leaves your network — strongest privacy story.
Full control over model, quantization, batching, priority.
At sufficient scale, cheapest per-token.

Cons:

GPU operations, capacity planning, failure modes you now own.
Model updates require your work — you're not on automatic "always latest."
Team cost: 1-3 infra engineers minimum.

Right for regulated industries, organizations with existing ML infrastructure, and very-high-volume workloads.

Pattern 4: Hybrid

Mix of patterns. Common setups:

Frontier calls go to managed APIs; bulk workloads go to managed inference platforms; sensitive workloads stay on-prem.
Production traffic on managed APIs; experimentation on self-hosted.
Routing layer sends specific model classes to specific backends.

Hybrid is the answer for most organizations past the "one product" stage.

The decision framework

Start with managed APIs. Move off only when you have a concrete forcing function:

Privacy/regulatory requirements that managed can't meet.
Cost at scale that makes self-hosting break even (usually >$50k/month).
Latency ceiling that managed can't hit (rare).
Specific model that's only open-weight.

The gotchas

Capacity planning on managed. Rate limits can surprise you at scale. Talk to your account team.
Cold starts on managed inference platforms. First request after idle can be slow. Warm-up endpoints matter.
Quantization on self-hosted. Running FP16 vs INT8 vs INT4 trades quality for cost. Benchmark, don't guess.
Version pinning. Whichever pattern, pin the exact model version. "Model updates" can change behavior overnight.

What scale-appropriate looks like

Scale	Default answer
<100k requests/month	Managed API
100k-2M requests/month	Managed API, maybe some tasks on managed inference
>2M requests/month	Hybrid likely, with cost modeling driving choices
Regulated + medium scale	Managed inference or self-hosted

Model serving architectures: what runs where

Pattern 1: Managed API (OpenAI, Anthropic, Google, etc.)

Pattern 3: Self-hosted on your infrastructure

Pattern 4: Hybrid

The decision framework

The gotchas

What scale-appropriate looks like

2-question self-check

Continue in this track

Model serving architectures: what runs where

Pattern 1: Managed API (OpenAI, Anthropic, Google, etc.)

Pattern 3: Self-hosted on your infrastructure

Pattern 4: Hybrid

The decision framework

The gotchas

What scale-appropriate looks like

2-question self-check

Continue in this track

Model serving architectures: what runs where

Pattern 1: Managed API (OpenAI, Anthropic, Google, etc.)

Pattern 2: Managed inference platforms (Together, Fireworks, Replicate, Modal, Baseten)

Pattern 3: Self-hosted on your infrastructure

Pattern 4: Hybrid

The decision framework

The gotchas

What scale-appropriate looks like

2-question self-check

Continue in this track

Model serving architectures: what runs where

Pattern 1: Managed API (OpenAI, Anthropic, Google, etc.)

Pattern 2: Managed inference platforms (Together, Fireworks, Replicate, Modal, Baseten)

Pattern 3: Self-hosted on your infrastructure

Pattern 4: Hybrid

The decision framework

The gotchas

What scale-appropriate looks like

2-question self-check

Continue in this track