Project: build a research agent end-to-end

This lesson is the capstone — a working research agent you can study, modify, and ship. It's deliberately small; production agents are built by iterating on a working base, not by architecting a perfect one upfront.

The spec

Build an agent that:

Takes a research question as input.
Plans 3-5 subtopics to investigate.
For each subtopic, searches the web and reads a few sources.
Synthesizes findings into a structured brief.
Cites its sources.

The tool graph

Only three tools needed:

web_search(query) — returns a list of {title, url, snippet}.
fetch_page(url) — returns the cleaned text of a web page.
finish(brief) — ends the agent with the final output.

That's it. Resist the urge to add more.

The loop

system prompt: you are a research assistant. produce a 400-600 word
brief with citations on the question below. use the tools iteratively.
when done, call finish(brief).

user: <research question>

[loop]
  model generates: tool call or finish
  if tool call: execute, append result to context, continue
  if finish: return brief
  if step count > 20 or cost > $1.00: abort with partial result

Key decisions worth getting right

Planning up front vs. as-you-go. Start with a plan-and-execute shape — model produces a list of subtopics first, then works through them. More coherent than a pure ReAct loop on this task.

Source diversity. If the first two results are the same site, encourage the agent to seek different perspectives. Add a rule: "prefer sources from different domains when possible."

Citation format. Decide upfront: inline [1], [2] with a references section? Or hyperlinks? Consistency matters more than format choice.

Token budget per subtopic. If the brief is 500 words, each subtopic gets ~100 words. Tell the model that. Agents tend to over-write without explicit length rules.

Eval set

Even for this small project, build 10-20 test questions. Graded by:

Coverage — did the brief address the question?
Factuality — do the cited sources support the claims? (spot-check.)
Conciseness — is it in the requested length?
Source diversity — do citations span multiple domains?

Run the set weekly as you iterate.

What you'll hit

Dead-end searches. The first search returns garbage. The agent needs a re-query strategy.
Conflicting sources. The agent sometimes picks one and ignores the other. Better: note the conflict in the brief.
Hallucinated citations. A classic failure — the model invents sources. Mitigation: validate every URL exists before including in the brief.
Over-long output. The model ignores the length rule. Enforce with a post-check that trims if over budget.

Where to take it

Once the basic version works:

Memory. Remember what was researched before; don't repeat.
Expert models. Use a stronger model for synthesis, a cheaper one for initial search.
Human review checkpoint. After the plan, pause for user edits before executing.
Interactive. Let the user ask follow-up questions that reuse the research context.

Each of these is 20-100 lines of code on top of the basic version.

The meta-lesson

Your first agent won't be your best. It'll reveal the next five things to improve. Ship the scrappy version first; iterate from real usage, not from imagination.

This lesson is the capstone — a working research agent you can study, modify, and ship. It's deliberately small; production agents are built by iterating on a working base, not by architecting a perfect one upfront.

The spec

Build an agent that:

Takes a research question as input.
Plans 3-5 subtopics to investigate.
For each subtopic, searches the web and reads a few sources.
Synthesizes findings into a structured brief.
Cites its sources.

The tool graph

Only three tools needed:

web_search(query) — returns a list of {title, url, snippet}.
fetch_page(url) — returns the cleaned text of a web page.
finish(brief) — ends the agent with the final output.

That's it. Resist the urge to add more.

The loop

system prompt: you are a research assistant. produce a 400-600 word
brief with citations on the question below. use the tools iteratively.
when done, call finish(brief).

user: <research question>

[loop]
  model generates: tool call or finish
  if tool call: execute, append result to context, continue
  if finish: return brief
  if step count > 20 or cost > $1.00: abort with partial result

Key decisions worth getting right

Planning up front vs. as-you-go. Start with a plan-and-execute shape — model produces a list of subtopics first, then works through them. More coherent than a pure ReAct loop on this task.

Source diversity. If the first two results are the same site, encourage the agent to seek different perspectives. Add a rule: "prefer sources from different domains when possible."

Citation format. Decide upfront: inline [1], [2] with a references section? Or hyperlinks? Consistency matters more than format choice.

Token budget per subtopic. If the brief is 500 words, each subtopic gets ~100 words. Tell the model that. Agents tend to over-write without explicit length rules.

Eval set

Even for this small project, build 10-20 test questions. Graded by:

Coverage — did the brief address the question?
Factuality — do the cited sources support the claims? (spot-check.)
Conciseness — is it in the requested length?
Source diversity — do citations span multiple domains?

Run the set weekly as you iterate.

What you'll hit

Dead-end searches. The first search returns garbage. The agent needs a re-query strategy.
Conflicting sources. The agent sometimes picks one and ignores the other. Better: note the conflict in the brief.
Hallucinated citations. A classic failure — the model invents sources. Mitigation: validate every URL exists before including in the brief.
Over-long output. The model ignores the length rule. Enforce with a post-check that trims if over budget.

Where to take it

Once the basic version works:

Memory. Remember what was researched before; don't repeat.
Expert models. Use a stronger model for synthesis, a cheaper one for initial search.
Human review checkpoint. After the plan, pause for user edits before executing.
Interactive. Let the user ask follow-up questions that reuse the research context.

Each of these is 20-100 lines of code on top of the basic version.

The meta-lesson

Your first agent won't be your best. It'll reveal the next five things to improve. Ship the scrappy version first; iterate from real usage, not from imagination.

Project: build a research agent end-to-end

The spec

The tool graph

The loop

Key decisions worth getting right

Eval set

What you'll hit

Where to take it

The meta-lesson

2-question self-check

Continue in this track

Project: build a research agent end-to-end

The spec

The tool graph

The loop

Key decisions worth getting right

Eval set

What you'll hit

Where to take it

The meta-lesson

2-question self-check

Continue in this track