Tokens, context windows, and why your prompts get cut off
The mechanics of context — and how to reason about fit, cost, and truncation.
Context windows got bigger, but the mechanics didn't change. Understanding how context works — and where it quietly breaks — is what separates reliable prompts from prayers.
What the context window actually is
Think of the context window as a scrolling buffer the model reads in full every time you call it. If your system prompt is 2,000 tokens, your chat history is 6,000 tokens, and you paste in a 3,000-token document, the model processes 11,000 tokens on every generation. You pay for input tokens and output tokens separately. The input cost adds up fast on long threads.
The three failure modes
- Truncation. If your conversation passes the window, the oldest messages silently drop off. The model "forgets" what it just agreed to ten messages ago. This is why long chat sessions degrade.
- Middle forgetting (the "lost in the middle" effect). Even within the window, most models attend less reliably to content near the middle of the context. Stuff at the very start and very end gets more attention.
- Output starvation. If you fill 120k of a 128k window with input, the model has very little room to respond. Budget input and output together.
Budgeting context like a resource
Pick one of these shapes, don't drift between them:
- Fat prompt, short response: classification, extraction, rating.
- Short prompt, fat response: writing, brainstorming, code generation.
- Balanced: most chat assistants.
What "long context" models really give you
Claude, Gemini, and GPT models with 1M+ token windows are useful for three things:
- Analyzing entire codebases in one shot instead of a RAG pipeline.
- Long documents — reading a book-length report and answering questions about it.
- Preserved chat state — not having to summarize a long session.
They're not a replacement for retrieval when your knowledge changes or is too large to fit. And they are expensive — a 500k-token query is 500× the cost of a 1k-token one.
In practice
Before every API call, ask yourself: what's in my context, and why? If you can't answer that in one sentence, you've probably got 5,000 tokens of chat noise the model is paying attention to instead of your actual ask.
Check your understanding
2-question self-check
Optional. Your answers feed your knowledge score on the track certificate.
Q1.What is the 'lost in the middle' effect?
Q2.Why does long chat history degrade model quality over time?
Continue in this track
More lessons from AI Fundamentals.
Lesson 1
What is a large language model, really?
Strip the hype. Learn what an LLM actually does, token by token.
Lesson 2
How models are trained (and why it matters to you)
Pre-training, instruction tuning, alignment — and what each one means for your choices.
Lesson 4
Your first useful prompt
Walk through structuring a prompt that gets consistent, production-quality output.