Observability for AI systems
What to trace, log, and alert on when the unit of work is a generation.
Traditional APM doesn't work for AI. The unit of work is a generation, not a request. Here's what to instrument and what to surface.
What makes AI observability different
- Outputs are non-deterministic. "Did this work?" isn't answerable from status codes.
- Costs vary per request, often by 10-100x. Budget overruns are a failure mode.
- Quality varies per request. Latency-focused metrics miss degradation.
- Errors are often semantic ("wrong answer"), not mechanical ("500 error").
You need a layer that understands the AI-specific concepts: prompts, completions, tools, chains.
The minimum viable trace
For every AI request, capture:
- Request ID (client-facing).
- Trace ID (linking to any downstream tool calls).
- User / session identifiers (hashed if privacy-sensitive).
- Prompt template name + version.
- Model name + version.
- Input token count (not the content).
- Output token count.
- Total latency.
- Stage-by-stage latency (retrieval, model call, post-processing).
- Cost in dollars.
- Outcome signal — thumbs up/down, completion, bounce, whatever you can capture.
- Sampled content — optionally store full input/output for a percentage of traffic for debugging.
Most provider-agnostic observability tools (Langfuse, Helicone, Braintrust, OpenLLMetry + Honeycomb) give you this with minimal integration.
What to alert on
Latency alerts you already know how to do. AI-specific alerts:
- Cost per hour crossed threshold. Indicates a loop, bad prompt, or traffic spike.
- Error rate from provider > baseline. Provider degradation — page someone.
- Average output tokens jumped. Often a prompt change that made outputs verbose.
- Quality signal degraded (if you have thumbs down rate, completion rate, etc.).
- Tool error rate on a specific tool crossed threshold. Isolated breakage in the agent stack.
The trace view that matters
When something goes wrong, you need to see:
- The exact prompt sent to the model (including all interpolations).
- The exact response received.
- Any tools called, with arguments and results.
- Any retrieval context fetched.
- The entire chain, top to bottom, with timing.
Building this yourself is substantial work; adopting a tool saves months.
Sampling policy
Don't log every full prompt/response — expensive and a privacy risk. Sample:
- 100% for errors and low-confidence outputs.
- 10% for normal traffic, rotating.
- 100% for requests flagged by users (thumbs down).
Keep logs for 14-30 days. Longer for errors.
Privacy-aware logging
- Never log full PII in prompt or response by default.
- Use per-user opt-in for full content logging if needed for debugging.
- Redact patterns (emails, phone numbers, SSNs, credit cards) at the logger level.
- Separate system prompt logs (fine to keep) from user content logs (tight retention).
The "three windows" UI your team will actually use
- Live feed: last N requests, in real-time. Good for "what's happening right now?"
- Per-user view: all requests from one user over time. Good for supporting complaints.
- Compare two prompts: side-by-side, same inputs, different prompts. Essential for prompt iteration.
If your observability tool lacks these, you'll end up building them in spreadsheets.
Tying outcomes back to requests
The most underrated piece: feedback signals (thumbs, completion, conversion) must be joinable with the request that produced them.
- Pass the request ID through to the UI.
- When the user reacts, record the request ID + reaction.
- Join in analytics.
Without this, you can't know which prompt version caused which outcome.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.Why doesn't traditional APM cover AI well?
Q2.What's the MINIMUM that should always be captured per AI call?
Continue in this track
More lessons from Deploying AI at Scale.
Lesson 3
Designing AI APIs your team can actually use
Rate limits, idempotency, streaming — the API patterns that save you later.
Lesson 4
Cost optimization without sacrificing quality
Where AI spend actually goes and where you can cut without regret.
Lesson 6
Handling failure gracefully: timeouts, fallbacks, degraded modes
Your AI service will fail. These are the patterns for surviving it.