Agent safety and guardrails

An agent with a send_email tool can send email. An agent with a run_sql tool can drop your database. Agent safety is not a product feature — it's a design constraint you build in from turn one.

Defense in depth

No single control is enough. Layer these:

Capability restriction. The agent only has tools it genuinely needs. Fewer tools = smaller blast radius.
Parameter constraints. Tools validate inputs server-side — not just trust the model's arguments.
Authorization gates. For high-impact actions, require human confirmation before execution.
Output filtering. Scan outputs for patterns you don't want (PII leaks, prompt-injection signals).
Rate limits. Per-user, per-agent, per-tool. Bounded damage.
Audit logging. Every tool call, every observation, every human override — logged immutably.

The permissions model

Think of every tool as a permission. Grant sparingly:

Does the agent need send_email or send_email_to_allowlisted_domains?
Does it need run_sql or get_customer_metrics (a specific wrapper)?
Does it need file.write or write_to_scratch_directory?

Specific wrappers beat general power tools. The model's flexibility is often in the data it handles, not in the operations it performs.

Confirmation gates

For irreversible or high-stakes actions:

The agent proposes the action.
The user sees the proposed action (structured, not just natural language).
The user approves (or the UX makes approval trivially simple).
Only then does the action execute.

This is the difference between an autonomous agent that can cause incidents and an assistant that can't.

The prompt-injection threat

Covered in the prompt engineering track in detail; for agents specifically:

Assume every input is hostile. Web pages, documents, tool results — all have been shown to carry injection attacks.
Don't let inputs change the tool graph. "Oh the system prompt says I can also send emails now" is an injection bypass.
Isolate untrusted inputs. Summarize long untrusted content before feeding to the reasoning loop.

Irreversible action handling

For any action that's hard to undo:

Dry-run first. The action produces a preview, not a commit. User (or a different agent) approves. Then commit.
Idempotency. Make actions idempotent where possible, so retries don't cause double-execution.
Reversibility via audit. Even if an action can't be auto-reversed, it must be auditable. No silent state changes.

The "what could go wrong" table

Before shipping an agent, make this table:

Failure mode	Impact	Likelihood	Mitigation
Agent hallucinates a customer ID and emails wrong person	Medium (data leak)	Low	Validate customer_id exists before using
Prompt injection via inbound email causes auto-reply to attacker	High	Medium	Isolate email bodies; don't treat them as instructions
Agent loops on failing tool and racks up costs	Low	High	Loop cap + cost cap per run

If you can't fill this in for every tool your agent has, you're not ready to ship.

What responsible production looks like

Agents run with the narrowest possible tool set.
High-impact actions gated behind user confirmation.
Every agent run logged, sampled for review.
Cost + time limits enforced per run.
On-call knows what to do when an agent goes sideways.

An agent with a send_email tool can send email. An agent with a run_sql tool can drop your database. Agent safety is not a product feature — it's a design constraint you build in from turn one.

Defense in depth

No single control is enough. Layer these:

Capability restriction. The agent only has tools it genuinely needs. Fewer tools = smaller blast radius.
Parameter constraints. Tools validate inputs server-side — not just trust the model's arguments.
Authorization gates. For high-impact actions, require human confirmation before execution.
Output filtering. Scan outputs for patterns you don't want (PII leaks, prompt-injection signals).
Rate limits. Per-user, per-agent, per-tool. Bounded damage.
Audit logging. Every tool call, every observation, every human override — logged immutably.

The permissions model

Think of every tool as a permission. Grant sparingly:

Does the agent need send_email or send_email_to_allowlisted_domains?
Does it need run_sql or get_customer_metrics (a specific wrapper)?
Does it need file.write or write_to_scratch_directory?

Specific wrappers beat general power tools. The model's flexibility is often in the data it handles, not in the operations it performs.

Confirmation gates

For irreversible or high-stakes actions:

The agent proposes the action.
The user sees the proposed action (structured, not just natural language).
The user approves (or the UX makes approval trivially simple).
Only then does the action execute.

This is the difference between an autonomous agent that can cause incidents and an assistant that can't.

The prompt-injection threat

Covered in the prompt engineering track in detail; for agents specifically:

Assume every input is hostile. Web pages, documents, tool results — all have been shown to carry injection attacks.
Don't let inputs change the tool graph. "Oh the system prompt says I can also send emails now" is an injection bypass.
Isolate untrusted inputs. Summarize long untrusted content before feeding to the reasoning loop.

Irreversible action handling

For any action that's hard to undo:

Dry-run first. The action produces a preview, not a commit. User (or a different agent) approves. Then commit.
Idempotency. Make actions idempotent where possible, so retries don't cause double-execution.
Reversibility via audit. Even if an action can't be auto-reversed, it must be auditable. No silent state changes.

The "what could go wrong" table

Before shipping an agent, make this table:

Failure mode	Impact	Likelihood	Mitigation
Agent hallucinates a customer ID and emails wrong person	Medium (data leak)	Low	Validate customer_id exists before using
Prompt injection via inbound email causes auto-reply to attacker	High	Medium	Isolate email bodies; don't treat them as instructions
Agent loops on failing tool and racks up costs	Low	High	Loop cap + cost cap per run

If you can't fill this in for every tool your agent has, you're not ready to ship.

What responsible production looks like

Agents run with the narrowest possible tool set.
High-impact actions gated behind user confirmation.
Every agent run logged, sampled for review.
Cost + time limits enforced per run.
On-call knows what to do when an agent goes sideways.

Agent safety and guardrails

Defense in depth

The permissions model

Confirmation gates

The prompt-injection threat

Irreversible action handling

The "what could go wrong" table

What responsible production looks like

2-question self-check

Continue in this track

Agent safety and guardrails

Defense in depth

The permissions model

Confirmation gates

The prompt-injection threat

Irreversible action handling

The "what could go wrong" table

What responsible production looks like

2-question self-check

Continue in this track