Observability for AI systems

Traditional APM doesn't work for AI. The unit of work is a generation, not a request. Here's what to instrument and what to surface.

What makes AI observability different

Outputs are non-deterministic. "Did this work?" isn't answerable from status codes.
Costs vary per request, often by 10-100x. Budget overruns are a failure mode.
Quality varies per request. Latency-focused metrics miss degradation.
Errors are often semantic ("wrong answer"), not mechanical ("500 error").

You need a layer that understands the AI-specific concepts: prompts, completions, tools, chains.

The minimum viable trace

For every AI request, capture:

Request ID (client-facing).
Trace ID (linking to any downstream tool calls).
User / session identifiers (hashed if privacy-sensitive).
Prompt template name + version.
Model name + version.
Input token count (not the content).
Output token count.
Total latency.
Stage-by-stage latency (retrieval, model call, post-processing).
Cost in dollars.
Outcome signal — thumbs up/down, completion, bounce, whatever you can capture.
Sampled content — optionally store full input/output for a percentage of traffic for debugging.

Most provider-agnostic observability tools (Langfuse, Helicone, Braintrust, OpenLLMetry + Honeycomb) give you this with minimal integration.

What to alert on

Latency alerts you already know how to do. AI-specific alerts:

Cost per hour crossed threshold. Indicates a loop, bad prompt, or traffic spike.
Error rate from provider > baseline. Provider degradation — page someone.
Average output tokens jumped. Often a prompt change that made outputs verbose.
Quality signal degraded (if you have thumbs down rate, completion rate, etc.).
Tool error rate on a specific tool crossed threshold. Isolated breakage in the agent stack.

The trace view that matters

When something goes wrong, you need to see:

The exact prompt sent to the model (including all interpolations).
The exact response received.
Any tools called, with arguments and results.
Any retrieval context fetched.
The entire chain, top to bottom, with timing.

Building this yourself is substantial work; adopting a tool saves months.

Sampling policy

Don't log every full prompt/response — expensive and a privacy risk. Sample:

100% for errors and low-confidence outputs.
10% for normal traffic, rotating.
100% for requests flagged by users (thumbs down).

Keep logs for 14-30 days. Longer for errors.

Privacy-aware logging

Never log full PII in prompt or response by default.
Use per-user opt-in for full content logging if needed for debugging.
Redact patterns (emails, phone numbers, SSNs, credit cards) at the logger level.
Separate system prompt logs (fine to keep) from user content logs (tight retention).

The "three windows" UI your team will actually use

Live feed: last N requests, in real-time. Good for "what's happening right now?"
Per-user view: all requests from one user over time. Good for supporting complaints.
Compare two prompts: side-by-side, same inputs, different prompts. Essential for prompt iteration.

If your observability tool lacks these, you'll end up building them in spreadsheets.

Tying outcomes back to requests

The most underrated piece: feedback signals (thumbs, completion, conversion) must be joinable with the request that produced them.

Pass the request ID through to the UI.
When the user reacts, record the request ID + reaction.
Join in analytics.

Without this, you can't know which prompt version caused which outcome.

Traditional APM doesn't work for AI. The unit of work is a generation, not a request. Here's what to instrument and what to surface.

What makes AI observability different

Outputs are non-deterministic. "Did this work?" isn't answerable from status codes.
Costs vary per request, often by 10-100x. Budget overruns are a failure mode.
Quality varies per request. Latency-focused metrics miss degradation.
Errors are often semantic ("wrong answer"), not mechanical ("500 error").

You need a layer that understands the AI-specific concepts: prompts, completions, tools, chains.

The minimum viable trace

For every AI request, capture:

Request ID (client-facing).
Trace ID (linking to any downstream tool calls).
User / session identifiers (hashed if privacy-sensitive).
Prompt template name + version.
Model name + version.
Input token count (not the content).
Output token count.
Total latency.
Stage-by-stage latency (retrieval, model call, post-processing).
Cost in dollars.
Outcome signal — thumbs up/down, completion, bounce, whatever you can capture.
Sampled content — optionally store full input/output for a percentage of traffic for debugging.

Most provider-agnostic observability tools (Langfuse, Helicone, Braintrust, OpenLLMetry + Honeycomb) give you this with minimal integration.

What to alert on

Latency alerts you already know how to do. AI-specific alerts:

Cost per hour crossed threshold. Indicates a loop, bad prompt, or traffic spike.
Error rate from provider > baseline. Provider degradation — page someone.
Average output tokens jumped. Often a prompt change that made outputs verbose.
Quality signal degraded (if you have thumbs down rate, completion rate, etc.).
Tool error rate on a specific tool crossed threshold. Isolated breakage in the agent stack.

The trace view that matters

When something goes wrong, you need to see:

The exact prompt sent to the model (including all interpolations).
The exact response received.
Any tools called, with arguments and results.
Any retrieval context fetched.
The entire chain, top to bottom, with timing.

Building this yourself is substantial work; adopting a tool saves months.

Sampling policy

Don't log every full prompt/response — expensive and a privacy risk. Sample:

100% for errors and low-confidence outputs.
10% for normal traffic, rotating.
100% for requests flagged by users (thumbs down).

Keep logs for 14-30 days. Longer for errors.

Privacy-aware logging

Never log full PII in prompt or response by default.
Use per-user opt-in for full content logging if needed for debugging.
Redact patterns (emails, phone numbers, SSNs, credit cards) at the logger level.
Separate system prompt logs (fine to keep) from user content logs (tight retention).

The "three windows" UI your team will actually use

Live feed: last N requests, in real-time. Good for "what's happening right now?"
Per-user view: all requests from one user over time. Good for supporting complaints.
Compare two prompts: side-by-side, same inputs, different prompts. Essential for prompt iteration.

If your observability tool lacks these, you'll end up building them in spreadsheets.

Tying outcomes back to requests

The most underrated piece: feedback signals (thumbs, completion, conversion) must be joinable with the request that produced them.

Pass the request ID through to the UI.
When the user reacts, record the request ID + reaction.
Join in analytics.

Without this, you can't know which prompt version caused which outcome.

Observability for AI systems

What makes AI observability different

The minimum viable trace

What to alert on

The trace view that matters

Sampling policy

Privacy-aware logging

The "three windows" UI your team will actually use

Tying outcomes back to requests

2-question self-check

Continue in this track

Observability for AI systems

What makes AI observability different

The minimum viable trace

What to alert on

The trace view that matters

Sampling policy

Privacy-aware logging

The "three windows" UI your team will actually use

Tying outcomes back to requests

2-question self-check

Continue in this track