Tokens, context windows, and why your prompts get cut off

Context windows got bigger, but the mechanics didn't change. Understanding how context works — and where it quietly breaks — is what separates reliable prompts from prayers.

What the context window actually is

Think of the context window as a scrolling buffer the model reads in full every time you call it. If your system prompt is 2,000 tokens, your chat history is 6,000 tokens, and you paste in a 3,000-token document, the model processes 11,000 tokens on every generation. You pay for input tokens and output tokens separately. The input cost adds up fast on long threads.

The three failure modes

Truncation. If your conversation passes the window, the oldest messages silently drop off. The model "forgets" what it just agreed to ten messages ago. This is why long chat sessions degrade.
Middle forgetting (the "lost in the middle" effect). Even within the window, most models attend less reliably to content near the middle of the context. Stuff at the very start and very end gets more attention.
Output starvation. If you fill 120k of a 128k window with input, the model has very little room to respond. Budget input and output together.

Budgeting context like a resource

Pick one of these shapes, don't drift between them:

Fat prompt, short response: classification, extraction, rating.
Short prompt, fat response: writing, brainstorming, code generation.
Balanced: most chat assistants.

What "long context" models really give you

Claude, Gemini, and GPT models with 1M+ token windows are useful for three things:

Analyzing entire codebases in one shot instead of a RAG pipeline.
Long documents — reading a book-length report and answering questions about it.
Preserved chat state — not having to summarize a long session.

They're not a replacement for retrieval when your knowledge changes or is too large to fit. And they are expensive — a 500k-token query is 500× the cost of a 1k-token one.

In practice

Before every API call, ask yourself: what's in my context, and why? If you can't answer that in one sentence, you've probably got 5,000 tokens of chat noise the model is paying attention to instead of your actual ask.

Context windows got bigger, but the mechanics didn't change. Understanding how context works — and where it quietly breaks — is what separates reliable prompts from prayers.

What the context window actually is

The three failure modes

Truncation. If your conversation passes the window, the oldest messages silently drop off. The model "forgets" what it just agreed to ten messages ago. This is why long chat sessions degrade.
Middle forgetting (the "lost in the middle" effect). Even within the window, most models attend less reliably to content near the middle of the context. Stuff at the very start and very end gets more attention.
Output starvation. If you fill 120k of a 128k window with input, the model has very little room to respond. Budget input and output together.

Budgeting context like a resource

Pick one of these shapes, don't drift between them:

Fat prompt, short response: classification, extraction, rating.
Short prompt, fat response: writing, brainstorming, code generation.
Balanced: most chat assistants.

What "long context" models really give you

Claude, Gemini, and GPT models with 1M+ token windows are useful for three things:

Analyzing entire codebases in one shot instead of a RAG pipeline.
Long documents — reading a book-length report and answering questions about it.
Preserved chat state — not having to summarize a long session.

They're not a replacement for retrieval when your knowledge changes or is too large to fit. And they are expensive — a 500k-token query is 500× the cost of a 1k-token one.

Tokens, context windows, and why your prompts get cut off

What the context window actually is

The three failure modes

Budgeting context like a resource

What "long context" models really give you

In practice

2-question self-check

Continue in this track

Tokens, context windows, and why your prompts get cut off

What the context window actually is

The three failure modes

Budgeting context like a resource

What "long context" models really give you

In practice

2-question self-check

Continue in this track