Designing AI APIs your team can actually use

AI endpoints have special shapes — streaming, long latency, non-determinism. Most mistakes come from designing them like normal REST endpoints. Here's what to do differently.

Request shape: think in jobs, not calls

A short "complete this text" call fits REST fine. Anything longer (research, multi-step, agent work) should be modeled as a job, not a single request:

POST /jobs → returns {id, status}. Kicks off the work.
GET /jobs/{id} → returns current status / partial output.
GET /jobs/{id}/stream → server-sent events for live updates.
DELETE /jobs/{id} → cancels an in-progress job.

This scales better: clients can come and go, jobs survive disconnects, retries are deterministic by job ID.

Streaming: first tokens matter most

For interactive use, start streaming within 500ms. That's the perceived-responsiveness threshold.

Practical tactics:

Get the request acknowledged fast, even before the model starts generating.
Send periodic "still working" markers during long gaps (tool calls, retrieval).
When the stream ends, send a final structured event with aggregate info (total tokens, cost, time).

Idempotency and retries

Every AI endpoint needs an idempotency key:

Client generates a UUID for each logical request.
Sends it in an Idempotency-Key header.
Server deduplicates if it sees the same key within N minutes.

Without this, retries on timeouts cause duplicate work — duplicate cost, duplicate side effects.

Timeouts and cancellation

AI calls can take anywhere from 300ms to 5 minutes. Pick reasonable tiers:

Simple endpoints (classification, extraction): 15s timeout.
Generative endpoints: 60s with streaming, longer without.
Agent endpoints: job pattern, no HTTP timeout.

And wire cancellation through:

Client closes connection → server aborts inference.
Client calls DELETE /jobs/{id} → worker process stops.

Orphaned inference calls are a real cost leak.

Rate limits, cost limits, concurrency limits

Three different things, all needed:

Rate limit: requests per time window. Protects against abuse.
Cost limit: dollar cap per user per day. Protects against bugs and loops.
Concurrency limit: max parallel requests per user. Protects you from being overwhelmed.

Each has appropriate default; each should be tunable per user or per tier.

Error model

AI endpoints fail in more ways than CRUD endpoints:

Provider is down.
Provider is up but slow.
Model output didn't match schema.
Content filter triggered.
Rate limit from the provider.
Prompt injection detected.

Design an error taxonomy up front. Clients should be able to distinguish retryable errors (provider 503) from non-retryable (malformed input).

API versioning

Your prompt, your model, and your response schema are a behavioral contract with API clients. Version them together:

/v1/chat → old prompt + old model + old schema.
/v2/chat → new version. Clients migrate at their pace.

Don't silently change prompt or model behind a versioned endpoint. Clients relying on specific output shapes will break.

Observability: what to surface to clients

Request ID in every response and error. Clients include in bug reports.
Cost info (optional header or body field). Helps clients budget.
Model version used. Useful for their own eval/debug.
Retry-After on rate limits.

The common design mistakes

Synchronous endpoints for jobs that can take >30 seconds.
No idempotency, so retries cost 2x.
Streaming without an end-of-stream signal.
Error messages that leak internal prompts or PII.
No way to cancel in-flight work.

AI endpoints have special shapes — streaming, long latency, non-determinism. Most mistakes come from designing them like normal REST endpoints. Here's what to do differently.

Request shape: think in jobs, not calls

A short "complete this text" call fits REST fine. Anything longer (research, multi-step, agent work) should be modeled as a job, not a single request:

POST /jobs → returns {id, status}. Kicks off the work.
GET /jobs/{id} → returns current status / partial output.
GET /jobs/{id}/stream → server-sent events for live updates.
DELETE /jobs/{id} → cancels an in-progress job.

This scales better: clients can come and go, jobs survive disconnects, retries are deterministic by job ID.

Streaming: first tokens matter most

For interactive use, start streaming within 500ms. That's the perceived-responsiveness threshold.

Practical tactics:

Get the request acknowledged fast, even before the model starts generating.
Send periodic "still working" markers during long gaps (tool calls, retrieval).
When the stream ends, send a final structured event with aggregate info (total tokens, cost, time).

Idempotency and retries

Every AI endpoint needs an idempotency key:

Client generates a UUID for each logical request.
Sends it in an Idempotency-Key header.
Server deduplicates if it sees the same key within N minutes.

Without this, retries on timeouts cause duplicate work — duplicate cost, duplicate side effects.

Timeouts and cancellation

AI calls can take anywhere from 300ms to 5 minutes. Pick reasonable tiers:

Simple endpoints (classification, extraction): 15s timeout.
Generative endpoints: 60s with streaming, longer without.
Agent endpoints: job pattern, no HTTP timeout.

And wire cancellation through:

Client closes connection → server aborts inference.
Client calls DELETE /jobs/{id} → worker process stops.

Orphaned inference calls are a real cost leak.

Rate limits, cost limits, concurrency limits

Three different things, all needed:

Rate limit: requests per time window. Protects against abuse.
Cost limit: dollar cap per user per day. Protects against bugs and loops.
Concurrency limit: max parallel requests per user. Protects you from being overwhelmed.

Each has appropriate default; each should be tunable per user or per tier.

Error model

AI endpoints fail in more ways than CRUD endpoints:

Provider is down.
Provider is up but slow.
Model output didn't match schema.
Content filter triggered.
Rate limit from the provider.
Prompt injection detected.

Design an error taxonomy up front. Clients should be able to distinguish retryable errors (provider 503) from non-retryable (malformed input).

API versioning

Your prompt, your model, and your response schema are a behavioral contract with API clients. Version them together:

/v1/chat → old prompt + old model + old schema.
/v2/chat → new version. Clients migrate at their pace.

Don't silently change prompt or model behind a versioned endpoint. Clients relying on specific output shapes will break.

Observability: what to surface to clients

Request ID in every response and error. Clients include in bug reports.
Cost info (optional header or body field). Helps clients budget.
Model version used. Useful for their own eval/debug.
Retry-After on rate limits.

The common design mistakes

Synchronous endpoints for jobs that can take >30 seconds.
No idempotency, so retries cost 2x.
Streaming without an end-of-stream signal.
Error messages that leak internal prompts or PII.
No way to cancel in-flight work.

Designing AI APIs your team can actually use

Request shape: think in jobs, not calls

Streaming: first tokens matter most

Idempotency and retries

Timeouts and cancellation

Rate limits, cost limits, concurrency limits

Error model

API versioning

Observability: what to surface to clients

The common design mistakes

2-question self-check

Continue in this track

Designing AI APIs your team can actually use

Request shape: think in jobs, not calls

Streaming: first tokens matter most

Idempotency and retries

Timeouts and cancellation

Rate limits, cost limits, concurrency limits

Error model

API versioning

Observability: what to surface to clients

The common design mistakes

2-question self-check

Continue in this track