What is a large language model, really?

Strip the hype. At the hardware level an LLM is doing one thing: guessing the next token. Everything you find impressive about it is built on top of that single operation.

Tokens, not words

When you send "How do I fine-tune a model?" to a language model, it doesn't see seven words. It sees a sequence of tokens — roughly sub-word units. "fine-tune" might be two tokens. A rare word can be four. Your API bill, your latency, and your context limits are all measured in tokens.

A useful calibration: 1 token ≈ 0.75 English words. A page of dense prose is about 500 tokens. GPT-4-class models handle ~128k tokens of context — that's a 200-page book.

The single operation

Given a sequence of tokens, the model outputs a probability distribution over every token in its vocabulary (typically 100k+ options). It picks one, appends it, and repeats. That's the whole mechanism.

The "intelligence" you see is an emergent property of doing next-token prediction very well, across trillions of training examples. There is no separate "reasoning" module. When you ask it to solve a puzzle, the reasoning happens in the tokens it generates — which is why telling it to "think step by step" actually helps.

Why this mental model matters

Context is everything. The model has zero memory between calls. Anything it "knows" about your task must be in the prompt or the tool results you feed back.
It doesn't know what it doesn't know. It predicts confident-sounding text. Calibration (knowing when it's unsure) is a weak point — one you have to engineer around.
Output shape drives output quality. If you want JSON, ask for JSON with a schema. If you want a list, ask for a list. The prompt tells the model what tokens look "probable" next.

In practice

Next time a model gives you a wrong answer, don't debug by re-phrasing with more superlatives. Ask: did my prompt give it enough context to produce the right next tokens? Nine times out of ten, that's the real question.

Strip the hype. At the hardware level an LLM is doing one thing: guessing the next token. Everything you find impressive about it is built on top of that single operation.

Tokens, not words

A useful calibration: 1 token ≈ 0.75 English words. A page of dense prose is about 500 tokens. GPT-4-class models handle ~128k tokens of context — that's a 200-page book.

The single operation

Why this mental model matters

Context is everything. The model has zero memory between calls. Anything it "knows" about your task must be in the prompt or the tool results you feed back.
It doesn't know what it doesn't know. It predicts confident-sounding text. Calibration (knowing when it's unsure) is a weak point — one you have to engineer around.
Output shape drives output quality. If you want JSON, ask for JSON with a schema. If you want a list, ask for a list. The prompt tells the model what tokens look "probable" next.

What is a large language model, really?

Tokens, not words

The single operation

Why this mental model matters

In practice

3-question self-check

Continue in this track

What is a large language model, really?

Tokens, not words

The single operation

Why this mental model matters

In practice

3-question self-check

Continue in this track