Handling failure gracefully: timeouts, fallbacks, degraded modes

Every AI system fails. The question is whether your users feel the failure or not. Graceful degradation is a design discipline, not an afterthought.

The failure taxonomy

Know the modes:

Provider outage. OpenAI / Anthropic down. Not fixable by you; recoverable by you.
Rate limit. You're hitting the provider's limit.
Timeout. Request taking too long.
Validation failure. Output doesn't match your schema.
Content filter. Model refused to answer.
Semantic failure. Output is valid but wrong.
Tool failure. Downstream API broke.
Infinite loop. Agent stuck.

Different modes, different responses.

The three rings of resilience

Ring 1: Prevention. Design to avoid failure.

Validate inputs.
Use structured output constraints.
Cap loops.
Conservative timeouts.

Ring 2: Detection. Notice failure when it happens.

Schema validation on outputs.
Confidence thresholds.
Known-bad pattern detection.

Ring 3: Recovery. Fail gracefully.

Retry with backoff.
Fallback to a cheaper / alternate model.
Degraded UX that still serves value.
Clear error messages that don't leak internals.

Retry strategy

Not all errors retry equally:

Error	Retry?	Notes
Provider 5xx	Yes, exponential backoff	3 attempts typical
Provider 429 rate limit	Yes, honor Retry-After	Don't retry faster
Timeout	Yes, with increased timeout	Exponential
Validation (schema)	Yes, ≤1 retry with stricter prompt	More retries rarely help
Content filter	No, don't retry	Might need reformulating by user
Auth error	No	Fix the key
Tool 4xx	Maybe	Depends on the tool

The cheap-model fallback

For latency- or availability-sensitive endpoints, wire a fallback chain:

Try primary (Claude Sonnet, say).
If unavailable or too slow, try secondary (GPT-4o).
If both fail, try tertiary (gpt-4o-mini).
If all fail, return a graceful error.

Trade-offs: responses across models differ subtly. For critical flows, consistency matters; for best-effort flows, availability wins.

Degraded mode UX

When the AI is unavailable, what does the user see?

A "try again" button? Usable, not great.
A static fallback response? Often better than nothing for common queries.
A "we're experiencing issues" banner with estimated resolution? Honest and calm.
Silent failure? Worst option — user is confused.

Design this explicitly, per endpoint.

Schema validation and re-ask

When the model returns invalid JSON or wrong fields:

Validate immediately with Zod / Pydantic.
Log the invalid output.
Re-ask the model with: "Your previous response didn't match the expected schema. Please return a response matching exactly: [schema]. Your previous response: [response]".
If the second attempt also fails, return a structured error to the caller.

One retry is worth it. Beyond one retry is usually chasing a broken prompt.

Bounded agent loops

Every agent needs termination conditions:

Max steps (e.g., 20).
Max time (e.g., 5 minutes).
Max cost (e.g., $1 per run).
Max consecutive failures (e.g., 3 — the agent can't make progress).

Hit any limit → abort with a partial result. Log the reason.

The silent failure

Worst case: AI endpoint returns a plausible wrong answer and users act on it without verification. No error surface, no alert, just wrong decisions shipping.

Mitigations:

Confidence thresholds — low-confidence results flagged or routed to human.
Post-hoc quality sampling — weekly review of random outputs.
User feedback surfaces — easy "this was wrong" button.

The on-call check

Imagine you get paged at 2am: "AI service is down." Can you:

See what's failing (observability)?
Verify the status of upstream providers?
Cut over to fallback or disable gracefully?
Understand what users are seeing?

If not, that's tomorrow's work.

Every AI system fails. The question is whether your users feel the failure or not. Graceful degradation is a design discipline, not an afterthought.

The failure taxonomy

Know the modes:

Provider outage. OpenAI / Anthropic down. Not fixable by you; recoverable by you.
Rate limit. You're hitting the provider's limit.
Timeout. Request taking too long.
Validation failure. Output doesn't match your schema.
Content filter. Model refused to answer.
Semantic failure. Output is valid but wrong.
Tool failure. Downstream API broke.
Infinite loop. Agent stuck.

Different modes, different responses.

The three rings of resilience

Ring 1: Prevention. Design to avoid failure.

Validate inputs.
Use structured output constraints.
Cap loops.
Conservative timeouts.

Ring 2: Detection. Notice failure when it happens.

Schema validation on outputs.
Confidence thresholds.
Known-bad pattern detection.

Ring 3: Recovery. Fail gracefully.

Retry with backoff.
Fallback to a cheaper / alternate model.
Degraded UX that still serves value.
Clear error messages that don't leak internals.

Retry strategy

Not all errors retry equally:

Error	Retry?	Notes
Provider 5xx	Yes, exponential backoff	3 attempts typical
Provider 429 rate limit	Yes, honor Retry-After	Don't retry faster
Timeout	Yes, with increased timeout	Exponential
Validation (schema)	Yes, ≤1 retry with stricter prompt	More retries rarely help
Content filter	No, don't retry	Might need reformulating by user
Auth error	No	Fix the key
Tool 4xx	Maybe	Depends on the tool

The cheap-model fallback

For latency- or availability-sensitive endpoints, wire a fallback chain:

Try primary (Claude Sonnet, say).
If unavailable or too slow, try secondary (GPT-4o).
If both fail, try tertiary (gpt-4o-mini).
If all fail, return a graceful error.

Trade-offs: responses across models differ subtly. For critical flows, consistency matters; for best-effort flows, availability wins.

Degraded mode UX

When the AI is unavailable, what does the user see?

A "try again" button? Usable, not great.
A static fallback response? Often better than nothing for common queries.
A "we're experiencing issues" banner with estimated resolution? Honest and calm.
Silent failure? Worst option — user is confused.

Design this explicitly, per endpoint.

Schema validation and re-ask

When the model returns invalid JSON or wrong fields:

Validate immediately with Zod / Pydantic.
Log the invalid output.
Re-ask the model with: "Your previous response didn't match the expected schema. Please return a response matching exactly: [schema]. Your previous response: [response]".
If the second attempt also fails, return a structured error to the caller.

One retry is worth it. Beyond one retry is usually chasing a broken prompt.

Bounded agent loops

Every agent needs termination conditions:

Max steps (e.g., 20).
Max time (e.g., 5 minutes).
Max cost (e.g., $1 per run).
Max consecutive failures (e.g., 3 — the agent can't make progress).

Hit any limit → abort with a partial result. Log the reason.

The silent failure

Worst case: AI endpoint returns a plausible wrong answer and users act on it without verification. No error surface, no alert, just wrong decisions shipping.

Mitigations:

Confidence thresholds — low-confidence results flagged or routed to human.
Post-hoc quality sampling — weekly review of random outputs.
User feedback surfaces — easy "this was wrong" button.

The on-call check

Imagine you get paged at 2am: "AI service is down." Can you:

See what's failing (observability)?
Verify the status of upstream providers?
Cut over to fallback or disable gracefully?
Understand what users are seeing?

If not, that's tomorrow's work.

Handling failure gracefully: timeouts, fallbacks, degraded modes

The failure taxonomy

The three rings of resilience

Retry strategy

The cheap-model fallback

Degraded mode UX

Schema validation and re-ask

Bounded agent loops

The silent failure

The on-call check

2-question self-check

Continue in this track

Handling failure gracefully: timeouts, fallbacks, degraded modes

The failure taxonomy

The three rings of resilience

Retry strategy

The cheap-model fallback

Degraded mode UX

Schema validation and re-ask

Bounded agent loops

The silent failure

The on-call check

2-question self-check

Continue in this track