Designing AI APIs your team can actually use
Rate limits, idempotency, streaming — the API patterns that save you later.
AI endpoints have special shapes — streaming, long latency, non-determinism. Most mistakes come from designing them like normal REST endpoints. Here's what to do differently.
Request shape: think in jobs, not calls
A short "complete this text" call fits REST fine. Anything longer (research, multi-step, agent work) should be modeled as a job, not a single request:
- POST /jobs → returns {id, status}. Kicks off the work.
- GET /jobs/{id} → returns current status / partial output.
- GET /jobs/{id}/stream → server-sent events for live updates.
- DELETE /jobs/{id} → cancels an in-progress job.
This scales better: clients can come and go, jobs survive disconnects, retries are deterministic by job ID.
Streaming: first tokens matter most
For interactive use, start streaming within 500ms. That's the perceived-responsiveness threshold.
Practical tactics:
- Get the request acknowledged fast, even before the model starts generating.
- Send periodic "still working" markers during long gaps (tool calls, retrieval).
- When the stream ends, send a final structured event with aggregate info (total tokens, cost, time).
Idempotency and retries
Every AI endpoint needs an idempotency key:
- Client generates a UUID for each logical request.
- Sends it in an
Idempotency-Keyheader. - Server deduplicates if it sees the same key within N minutes.
Without this, retries on timeouts cause duplicate work — duplicate cost, duplicate side effects.
Timeouts and cancellation
AI calls can take anywhere from 300ms to 5 minutes. Pick reasonable tiers:
- Simple endpoints (classification, extraction): 15s timeout.
- Generative endpoints: 60s with streaming, longer without.
- Agent endpoints: job pattern, no HTTP timeout.
And wire cancellation through:
- Client closes connection → server aborts inference.
- Client calls DELETE /jobs/{id} → worker process stops.
Orphaned inference calls are a real cost leak.
Rate limits, cost limits, concurrency limits
Three different things, all needed:
- Rate limit: requests per time window. Protects against abuse.
- Cost limit: dollar cap per user per day. Protects against bugs and loops.
- Concurrency limit: max parallel requests per user. Protects you from being overwhelmed.
Each has appropriate default; each should be tunable per user or per tier.
Error model
AI endpoints fail in more ways than CRUD endpoints:
- Provider is down.
- Provider is up but slow.
- Model output didn't match schema.
- Content filter triggered.
- Rate limit from the provider.
- Prompt injection detected.
Design an error taxonomy up front. Clients should be able to distinguish retryable errors (provider 503) from non-retryable (malformed input).
API versioning
Your prompt, your model, and your response schema are a behavioral contract with API clients. Version them together:
/v1/chat→ old prompt + old model + old schema./v2/chat→ new version. Clients migrate at their pace.
Don't silently change prompt or model behind a versioned endpoint. Clients relying on specific output shapes will break.
Observability: what to surface to clients
- Request ID in every response and error. Clients include in bug reports.
- Cost info (optional header or body field). Helps clients budget.
- Model version used. Useful for their own eval/debug.
- Retry-After on rate limits.
The common design mistakes
- Synchronous endpoints for jobs that can take >30 seconds.
- No idempotency, so retries cost 2x.
- Streaming without an end-of-stream signal.
- Error messages that leak internal prompts or PII.
- No way to cancel in-flight work.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.For long-running AI work (>30s), the cleanest API shape is…
Q2.Idempotency keys on AI endpoints are necessary because…
Continue in this track
More lessons from Deploying AI at Scale.
Lesson 1
From notebook to service: the first real leap
The hard parts of making a prototype into a real API.
Lesson 2
Model serving architectures: what runs where
Managed APIs, self-hosted inference, and the hybrid middle ground.
Lesson 4
Cost optimization without sacrificing quality
Where AI spend actually goes and where you can cut without regret.