Agent safety and guardrails
Defense in depth for agents that take real actions.
An agent with a
send_emailtool can send email. An agent with arun_sqltool can drop your database. Agent safety is not a product feature — it's a design constraint you build in from turn one.
Defense in depth
No single control is enough. Layer these:
- Capability restriction. The agent only has tools it genuinely needs. Fewer tools = smaller blast radius.
- Parameter constraints. Tools validate inputs server-side — not just trust the model's arguments.
- Authorization gates. For high-impact actions, require human confirmation before execution.
- Output filtering. Scan outputs for patterns you don't want (PII leaks, prompt-injection signals).
- Rate limits. Per-user, per-agent, per-tool. Bounded damage.
- Audit logging. Every tool call, every observation, every human override — logged immutably.
The permissions model
Think of every tool as a permission. Grant sparingly:
- Does the agent need
send_emailorsend_email_to_allowlisted_domains? - Does it need
run_sqlorget_customer_metrics(a specific wrapper)? - Does it need
file.writeorwrite_to_scratch_directory?
Specific wrappers beat general power tools. The model's flexibility is often in the data it handles, not in the operations it performs.
Confirmation gates
For irreversible or high-stakes actions:
- The agent proposes the action.
- The user sees the proposed action (structured, not just natural language).
- The user approves (or the UX makes approval trivially simple).
- Only then does the action execute.
This is the difference between an autonomous agent that can cause incidents and an assistant that can't.
The prompt-injection threat
Covered in the prompt engineering track in detail; for agents specifically:
- Assume every input is hostile. Web pages, documents, tool results — all have been shown to carry injection attacks.
- Don't let inputs change the tool graph. "Oh the system prompt says I can also send emails now" is an injection bypass.
- Isolate untrusted inputs. Summarize long untrusted content before feeding to the reasoning loop.
Irreversible action handling
For any action that's hard to undo:
- Dry-run first. The action produces a preview, not a commit. User (or a different agent) approves. Then commit.
- Idempotency. Make actions idempotent where possible, so retries don't cause double-execution.
- Reversibility via audit. Even if an action can't be auto-reversed, it must be auditable. No silent state changes.
The "what could go wrong" table
Before shipping an agent, make this table:
| Failure mode | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Agent hallucinates a customer ID and emails wrong person | Medium (data leak) | Low | Validate customer_id exists before using |
| Prompt injection via inbound email causes auto-reply to attacker | High | Medium | Isolate email bodies; don't treat them as instructions |
| Agent loops on failing tool and racks up costs | Low | High | Loop cap + cost cap per run |
If you can't fill this in for every tool your agent has, you're not ready to ship.
What responsible production looks like
- Agents run with the narrowest possible tool set.
- High-impact actions gated behind user confirmation.
- Every agent run logged, sampled for review.
- Cost + time limits enforced per run.
- On-call knows what to do when an agent goes sideways.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.Which safety principle does the lesson emphasize most?
Q2.What's the rule for irreversible actions?
Continue in this track
More lessons from Building AI Agents.
Lesson 5
Multi-agent systems without the chaos
When multiple agents help, when they don't, and how to coordinate them.
Lesson 6
Evaluating agents (this is hard)
Why agent eval is different from LLM eval, and the harness patterns that work.
Lesson 8
Project: build a research agent end-to-end
Ship a working research agent with tools, memory, and eval.