Handling failure gracefully: timeouts, fallbacks, degraded modes
Your AI service will fail. These are the patterns for surviving it.
Every AI system fails. The question is whether your users feel the failure or not. Graceful degradation is a design discipline, not an afterthought.
The failure taxonomy
Know the modes:
- Provider outage. OpenAI / Anthropic down. Not fixable by you; recoverable by you.
- Rate limit. You're hitting the provider's limit.
- Timeout. Request taking too long.
- Validation failure. Output doesn't match your schema.
- Content filter. Model refused to answer.
- Semantic failure. Output is valid but wrong.
- Tool failure. Downstream API broke.
- Infinite loop. Agent stuck.
Different modes, different responses.
The three rings of resilience
Ring 1: Prevention. Design to avoid failure.
- Validate inputs.
- Use structured output constraints.
- Cap loops.
- Conservative timeouts.
Ring 2: Detection. Notice failure when it happens.
- Schema validation on outputs.
- Confidence thresholds.
- Known-bad pattern detection.
Ring 3: Recovery. Fail gracefully.
- Retry with backoff.
- Fallback to a cheaper / alternate model.
- Degraded UX that still serves value.
- Clear error messages that don't leak internals.
Retry strategy
Not all errors retry equally:
| Error | Retry? | Notes |
|---|---|---|
| Provider 5xx | Yes, exponential backoff | 3 attempts typical |
| Provider 429 rate limit | Yes, honor Retry-After | Don't retry faster |
| Timeout | Yes, with increased timeout | Exponential |
| Validation (schema) | Yes, ≤1 retry with stricter prompt | More retries rarely help |
| Content filter | No, don't retry | Might need reformulating by user |
| Auth error | No | Fix the key |
| Tool 4xx | Maybe | Depends on the tool |
The cheap-model fallback
For latency- or availability-sensitive endpoints, wire a fallback chain:
- Try primary (Claude Sonnet, say).
- If unavailable or too slow, try secondary (GPT-4o).
- If both fail, try tertiary (gpt-4o-mini).
- If all fail, return a graceful error.
Trade-offs: responses across models differ subtly. For critical flows, consistency matters; for best-effort flows, availability wins.
Degraded mode UX
When the AI is unavailable, what does the user see?
- A "try again" button? Usable, not great.
- A static fallback response? Often better than nothing for common queries.
- A "we're experiencing issues" banner with estimated resolution? Honest and calm.
- Silent failure? Worst option — user is confused.
Design this explicitly, per endpoint.
Schema validation and re-ask
When the model returns invalid JSON or wrong fields:
- Validate immediately with Zod / Pydantic.
- Log the invalid output.
- Re-ask the model with: "Your previous response didn't match the expected schema. Please return a response matching exactly: [schema]. Your previous response: [response]".
- If the second attempt also fails, return a structured error to the caller.
One retry is worth it. Beyond one retry is usually chasing a broken prompt.
Bounded agent loops
Every agent needs termination conditions:
- Max steps (e.g., 20).
- Max time (e.g., 5 minutes).
- Max cost (e.g., $1 per run).
- Max consecutive failures (e.g., 3 — the agent can't make progress).
Hit any limit → abort with a partial result. Log the reason.
The silent failure
Worst case: AI endpoint returns a plausible wrong answer and users act on it without verification. No error surface, no alert, just wrong decisions shipping.
Mitigations:
- Confidence thresholds — low-confidence results flagged or routed to human.
- Post-hoc quality sampling — weekly review of random outputs.
- User feedback surfaces — easy "this was wrong" button.
The on-call check
Imagine you get paged at 2am: "AI service is down." Can you:
- See what's failing (observability)?
- Verify the status of upstream providers?
- Cut over to fallback or disable gracefully?
- Understand what users are seeing?
If not, that's tomorrow's work.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.Which error is almost never worth retrying?
Q2.A 'cheap-model fallback chain' is useful when…
Continue in this track
More lessons from Deploying AI at Scale.
Lesson 4
Cost optimization without sacrificing quality
Where AI spend actually goes and where you can cut without regret.
Lesson 5
Observability for AI systems
What to trace, log, and alert on when the unit of work is a generation.
Lesson 7
A/B testing generative output
How to run valid experiments when every response is different.