Chain-of-thought: when it helps and when it hurts

Chain-of-thought prompting used to be a superpower. Now it's a default in many models — and in some cases it's actively making things worse. Here's when to use it, when to skip it, and when to trust the model's internal reasoning instead.

What chain-of-thought actually does

"Think step by step before answering" prompts the model to generate reasoning tokens before the answer. That reasoning lives in the context, and the final answer is generated with those reasoning tokens in view. The model effectively uses its own output as scratch space.

For hard problems — multi-step math, complex logical inference, ambiguous classification — this consistently improves quality. For easy problems, it adds latency and can introduce errors by "overthinking."

The 2026 wrinkle: reasoning models

OpenAI's o-series, Claude's Thinking mode, Gemini's Thinking — these models do chain-of-thought internally and don't surface it in the final response (or surface it separately). When you're using a reasoning model, adding "think step by step" in your prompt is redundant at best and confusing at worst.

Rule of thumb:

Reasoning model? Don't CoT-prompt. Let the model's built-in reasoning handle it.
Non-reasoning model on a hard problem? Ask for step-by-step reasoning.
Non-reasoning model on an easy problem? Skip it — the latency isn't worth it.

When CoT hurts

Simple classification. "Is this spam?" doesn't need reasoning; it needs a 1-token answer.
Tight latency budgets. Each reasoning token is ~50-150ms. A six-paragraph CoT adds real UX cost.
Over-cautious outputs. CoT can talk the model into finding edge cases that don't matter, producing hedged, unhelpful answers.

Patterns that sharpen CoT when you do use it

Bound the reasoning length. "Think in 3-5 short steps, then give the answer." Prevents reasoning diarrhea.
Separate reasoning from output. Ask for <reasoning> tags followed by <answer> tags. Makes it easy to parse and optionally hide the reasoning.
Reverse CoT for critique. "State your answer. Then state the strongest argument against it. Revise if the counter-argument is valid." Produces more balanced results than forward CoT.

Self-consistency

A research-era trick: generate multiple CoT reasonings, pick the majority answer. Expensive in tokens, but genuinely helps on reasoning tasks. Some production systems use it for very-high-stakes decisions where latency doesn't matter.

In practice

Run the A/B test on your own task. Take 20 evals. Run once with CoT, once without. Look at quality, latency, cost. Many teams discover their CoT wasn't actually helping.

Chain-of-thought prompting used to be a superpower. Now it's a default in many models — and in some cases it's actively making things worse. Here's when to use it, when to skip it, and when to trust the model's internal reasoning instead.

What chain-of-thought actually does

The 2026 wrinkle: reasoning models

Rule of thumb:

Reasoning model? Don't CoT-prompt. Let the model's built-in reasoning handle it.
Non-reasoning model on a hard problem? Ask for step-by-step reasoning.
Non-reasoning model on an easy problem? Skip it — the latency isn't worth it.

When CoT hurts

Simple classification. "Is this spam?" doesn't need reasoning; it needs a 1-token answer.
Tight latency budgets. Each reasoning token is ~50-150ms. A six-paragraph CoT adds real UX cost.
Over-cautious outputs. CoT can talk the model into finding edge cases that don't matter, producing hedged, unhelpful answers.

Patterns that sharpen CoT when you do use it

Bound the reasoning length. "Think in 3-5 short steps, then give the answer." Prevents reasoning diarrhea.
Separate reasoning from output. Ask for <reasoning> tags followed by <answer> tags. Makes it easy to parse and optionally hide the reasoning.
Reverse CoT for critique. "State your answer. Then state the strongest argument against it. Revise if the counter-argument is valid." Produces more balanced results than forward CoT.

Self-consistency

In practice

Run the A/B test on your own task. Take 20 evals. Run once with CoT, once without. Look at quality, latency, cost. Many teams discover their CoT wasn't actually helping.

Chain-of-thought: when it helps and when it hurts

What chain-of-thought actually does

The 2026 wrinkle: reasoning models

When CoT hurts

Patterns that sharpen CoT when you do use it

Self-consistency

In practice

2-question self-check

Continue in this track

Chain-of-thought: when it helps and when it hurts

What chain-of-thought actually does

The 2026 wrinkle: reasoning models

When CoT hurts

Patterns that sharpen CoT when you do use it

Self-consistency

In practice

2-question self-check

Continue in this track