From notebook to service: the first real leap
The hard parts of making a prototype into a real API.
The gap between a notebook prototype and a production AI service is bigger than most engineers expect. It's usually not the ML — it's everything around the ML.
What you actually have after the notebook
A notebook that works for you, on your laptop, with your data, on the demo inputs. That's the 5% that was easy.
The missing 95%:
- A service that accepts requests and returns responses reliably.
- Authentication and rate limiting.
- Observability — logs, traces, metrics.
- Error handling that doesn't leak internals or hang.
- Cost controls.
- Deployment, rollback, versioning.
- Security review for the data flow.
- Someone on-call when it breaks.
The first real leap: wrap it in a service
Take the notebook logic, extract it into a function with a clear input/output contract. Wrap in a thin HTTP service (FastAPI, Next.js route handler, Cloudflare Worker — whatever your stack is).
Service contract considerations:
- Timeouts. How long do you hold a request open? Long-running AI calls need async patterns — return a task ID, poll for results.
- Idempotency. Can the same request be sent twice (retries)? Design the endpoint so yes, safely.
- Request validation. Reject malformed inputs at the edge. Don't let them reach the model.
Streaming or not
User-facing AI that generates text should stream tokens back — UX expectation. Internal pipelines can wait for the full response.
Streaming complicates:
- Error handling (error mid-stream vs. before first token).
- Client reconnection on network blips.
- Buffering and backpressure if the client is slow.
Most frameworks handle this now (Vercel AI SDK, FastAPI's StreamingResponse, etc.). Don't reinvent.
The first observability you need
Before traffic hits the endpoint, ensure you log:
- Request ID on every call, returned to the client.
- Input length in tokens (don't log the input itself for privacy).
- Output length in tokens.
- Latency end-to-end and per-stage (retrieval, inference, post-processing).
- Model used and model version.
- Cost of the call.
Without these five, debugging in production is guesswork.
The first cost controls
- Per-request max tokens. Hard cap. Prevents one user from racking up $10 of output accidentally.
- Per-user rate limit. 10 requests/minute is a sensible default for interactive use.
- Per-day budget alarm. If aggregate cost crosses a threshold, page someone.
Without these, one bad deployment or one runaway loop can cost real money.
Where prototypes quietly fail in production
- Input assumptions. The notebook handled nice clean inputs; real traffic has emojis, different languages, empty strings, pasted HTML.
- Error surfaces. The model returns a 500; your code throws; the client sees a generic "something went wrong." Now debug it.
- Prompt drift. The prompt worked on test inputs; real users ask weirder questions; quality degrades. Build eval on real traffic.
The "ship something" advice
Don't over-engineer the first deploy. A Next.js route handler calling the OpenAI API with basic error handling and logging is a legitimate production service for many use cases. Resist building a "proper ML platform" on day one.
You'll learn what you actually need from running it.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.Moving from a working notebook to a production service mostly involves…
Q2.On day one of a new AI service, the minimum observability to capture is…
Continue in this track
More lessons from Deploying AI at Scale.
Lesson 2
Model serving architectures: what runs where
Managed APIs, self-hosted inference, and the hybrid middle ground.
Lesson 3
Designing AI APIs your team can actually use
Rate limits, idempotency, streaming — the API patterns that save you later.
Lesson 4
Cost optimization without sacrificing quality
Where AI spend actually goes and where you can cut without regret.