Chain of Thought in the Era of Reasoning Models

New Perspectives on an Old Friend

Feb 01, 2025

A fun and cartoonish illustration of a chain made entirely of multiple floating thought bubbles, each containing the word 'WAIT' in bold, playful letters. The thought bubbles are in bright, cheerful colors like yellow, pink, and blue, all interconnected by shiny metallic chain links. The background is a vibrant sky-blue with fluffy white clouds, giving a lighthearted and humorous mood. The overall style is whimsical, exaggerated, and full of energy, emphasizing the theme of anticipation and overthinking.

Chain of Thought (CoT) is a prompting strategy that was published in the literature a very long time ago, in 2022, before computers existed. And yet, it has figured prominently in the announcements and discussions of the recent batch of inference time compute scaled reasoning models, like OpenAI’s o-1/3 models and DeepSeek’s r1. In particular, r1 exposes its CoT to users, which has led to much delight for many.

This makes it a good time to revisit CoT. As we will see, it’s not just an old prompting strategy. It reveals some important features of LLM behavior that are not often talked about, and it helps explain why reasoning models work the way they do.

If you’re interested in the answer to any of these questions, then you should read this post:

why does CoT prompting work?
why do reasoning models like r1 have such weird looking thoughts?
what makes some LLMs better than others?
what is a reasoning budget?
why do instructions in prompts sometimes just not take?

origins

CoT was originally conceived of as a prompting technique for LLMs. The idea was that by including examples of problems with step by step solutions in prompts, it would induce the LLM to try to solve the given problem step by step, leading to better results. Otherwise, by default, LLMs would attempt to answer questions in a single step without any explicit “thinking”. Later it was shown that even just prompting the model to “think step by step”, without step by step examples, was similarly useful.

Why does this work?

The original CoT paper states that one of the attractive properties of CoT is that it “allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps” (3).

What exactly does this mean?

If we assume a fixed quantity of computation per step (i.e., chunk of tokens), which is approximately true in a technical sense, then taking more steps to solve a problem means using more computation than would be used to solve it in a single step. If we assume that more computation enables better reasoning, then it follows that using multiple steps (i.e, more tokens) to solve a problem will lead to better results1.

what is computation?

But I have always found this framing a little bit curious. Instinctively, I have not pictured the benefit of CoT as being that it allows for more computation, although it is true that it does.

Instead, I have always seen the benefit to be that it requires less computation, but of a different sort. I think in a certain sense this is also correct, and reveals some important properties of how LLMs reason.

Part of the trouble here is what exactly is meant by the word “computation”. If we use the term as strictly as possible, which I think is close to how it’s used in the CoT paper, then it means the actual mathematical operations performed in the LLM’s network to generate the token probabilities of the next token. The amount of computation is approximately equal for each step in token generation, so generating more tokens necessarily requires more computation, and so longer answers (more steps) use more computation.

But it can also be useful, and natural, to think of “computation” as a property of a problem or a sequence of tokens, rather than a property of an LLM. We might say that certain problems require more or less computation, or reasoning, to be solved. Or certain sequences of tokens require more or less computation, or reasoning, to generate the correct probabilities for.

For instance, solving the problem 7 * 3 requires less reasoning than solving the problem 78 * 34, which requires less reasoning than solving 789 * 345, and so on when multiplying increasingly large numbers. In order to generate the correct probabilities for the first token in response to each of these questions, the exact same amount of technical computation is required from the LLM, but the reasoning required is greater for each in turn.

reasoning budgets

So how does this relate to CoT requiring less computation, or reasoning?

We can make much better sense of how LLMs work, and how CoT helps them, if we introduce the concept of a reasoning budget. A reasoning budget is the amount of reasoning that a given LLM can successfully execute over a sequence of tokens.

Sticking to the example of multiplication problems, one LLM might have a reasoning budget that allows it to reliably solve two digit multiplications in a single reasoning step, but not to solve three digit multiplications. Whereas another LLM’s reasoning budget might be enough to reliably solve four digit multiplications in a single step. In this scenario, the first LLM has a smaller reasoning budget, and the second has a larger reasoning budget2.

Equipped with the concept of a reasoning budget, we can understand why CoT works. CoT prompting induces an LLM to decompose a problem into smaller, constituent subproblems. By definition, these subproblems require less reasoning to solve than the original problem.

If the original problem exceeds an LLM’s reasoning budget, then it will not be able to solve it in a single step. But if each subproblem of the CoT decomposition is within the LLM’s reasoning budget, then it will be able to solve each subproblem, and in turn, the original problem.

That is what I mean by CoT requiring less computation to solve a problem. If we think of computation as the amount of reasoning required to solve a problem, then using CoT allows an LLM to use less computation per step in order to solve a problem.

However, over the entirety of the response, CoT does allow the LLM to do more computation, both in the technical definition and in the reasoning definition. This has already been described for the technical definition. In the reasoning definition, the accumulation of smaller amounts of reasoning leads to more total reasoning than the LLM could accomplish in a single step. That is, by doing less reasoning at a time, the LLM can successfully reason more over the entirety of a response.

more reasoning budgets

If CoT prompting works because it allows LLMs to allocate more computation to a problem through more tokens via more reasoning steps, then it raises some interesting questions:

shouldn’t an inferior LLM be able to do anything a superior LLM can do, as long as it breaks the problem down into enough steps?
is the only limit to an LLM’s reasoning capabilities the number of tokens it can output?

I think that, naively, the principle of CoT implies that the answer to both questions above is “yes”. But in practice, it’s obvious that the answer is “no”.

Inferior LLMs clearly have a limit to how much reasoning they can successfully execute in a single forward pass, and it is usually far below their maximum output token limit. They usually run into trouble somewhere along the way, losing focus, ignoring instructions, forgetting relevant details, and so on3.

In fact, we can think of this limit as another reasoning budget. We might then say that there are two types of reasoning budgets:

a per-token reasoning budget, which is what was described in the previous section. This is how much reasoning an LLM can accomplish per-token, but in practice meaningful reasoning is usually demonstrated over several tokens, i.e., a reasoning step, or a premise. So we might also speak of a per-step reasoning budget, but this could always be expressed as a per-token budget, and tokens are easier to count and segment.
a per-prompt reasoning budget, which is the second type of budget I have just described. This is how much reasoning an LLM can execute in an entire output sequence in response to a prompt, regardless of how many tokens it consists of.

The per-prompt reasoning budget explains much of what makes some LLMs better than others. Better LLMs have high per-prompt reasoning budgets, and so they are able to solve more complex tasks and handle more nuance. Worse LLMs aren’t worse just because they need problems to be broken down more; there’s a limit to how much they can do even with smaller steps.

applications

Per-token and per-prompt reasoning budgets help explain certain behaviors of LLMs, and in turn, are very useful concepts for getting better results out of LLMs.

The per-token reasoning budget is useful for understanding why CoT works, as explained above. But it’s also useful for making the best use of CoT.

In principle, merely breaking a problem down into smaller steps is not enough for CoT to be successful. The per-token reasoning budget concept predicts that CoT will only be successful if each step in the CoT is within the LLM’s per-token reasoning budget.

Understanding this is of immense value for practical LLM use. In another post I will discuss practical prompting in more detail, but for now I will just say that one of the most important generally applicable prompting strategies is to lay out the specific CoT that you want the LLM to follow in order to solve a problem4. In order to do this well, you need to understand what the LLM’s per-token reasoning budget is. If you decompose the problem into steps that are too complex, the LLM will still get the answer wrong. Ideally, you can identify the steps that are as large as possible, but still within the LLM’s per-token reasoning budget, in order to get the correct response while maximizing token efficiency.

Similarly, the per-prompt reasoning budget is also useful for understanding LLMs and getting more out of them. If you have done much complex prompting, you will find that at some point, additional instructions in a prompt just don’t take. You can add more rules, more details, explain edge cases, whatever, but when you read the LLM’s output, it’s as though it’s ignoring them. Or the outputs are variable across repeated sampling, sometimes following some subset of the instructions, other times not.

This is readily apparent with smaller models, e.g., 4o-mini, where it’s quite easy to exceed the model’s per-prompt reasoning budget. Naturally, the reasoning budget of a model like 4o is much higher, but it’s not that hard to exceed either, if you are doing the right type of work.

Being able to recognize when an LLM has exceeded its per-prompt reasoning budget is extremely useful. This is the point at which no further instructions can be productively added to a prompt5. If you can recognize this point, then you will know that more prompting won’t get you anywhere and you’ll need to try other solutions, like using a better model if available, or prompt chaining or fine-tuning.

reasoning CoTs

CoT prompting is so obviously effective that it was quickly absorbed into common practice, and alongside that LLMs were trained to produce CoT outputs even when not prompted to do so. It became so ubiquitous that it was practically imperceptible. So it was curious to see CoT return to the spotlight with the recent advent of reasoning models like OpenAI’s o1 and o3, and DeepSeek’s r1.

The original o1 announcement in September 2024 was rather light on technical details, but did mention “chain of thought” many times. And indeed it was evident that something new was going on with the chains of thought produced by the o1 models.

What immediately struck me at the time was that I had never seen an LLM remain so persistently focused on a problem, trying new solutions, reconsidering its assumptions, and so on, for so many tokens in a row. I hadn’t yet named the concept at the time, but now it’s clear to me that one important property of the CoTs from reasoning models is a greatly increased per-prompt reasoning budget relative to previous frontier LLMs.

But the other distinctive feature of the CoTs of reasoning models is the nature of the reasoning steps. It was also clear that they are quite different from the CoTs of models like 4o. They’re much more organic and human sounding. The significance of that didn’t quite register with me until I saw this post from an OpenAI researcher, who, certainly not coincidentally, was the author of the original CoT paper:

There is a nuanced but important difference between chain-of-thought before and after o1.
Before the o1 paradigm (i.e., chain-of-thought prompting), there was a mismatch between what chain of thought was and what we wanted it to be. We wanted chain of thought to reflect the thinking process of the model, but what the model was really doing was just imitating reasoning paths that it had seen in pretraining, e.g., math homework solutions. The problem with this type of data is that it is a post-hoc solution summarized after the author did all the work somewhere else, and not really a sequence of thoughts. So the solutions often had poor information density, with an egregious example being things like “The answer is 5 because…”, where the token “5” has a huge amount of new information.
With the o1 paradigm, you can see that the chain of thought looks very different from a textbook math solution (you can view examples in the blog post). These chains of thought are kinda like “inner monologue” or “stream of consciousness”. You can see the model backtracking; it says things like “alternatively, let’s try” or “wait, but”. And I have not measured directly, but I would wager a bet (my psycholinguistics friends would probably be able to confirm) that the information density is *much* more uniform in the chain of thought than average text on the internet.

The key is in the last sentence: “the information density is *much* more uniform in the chain of thought than average text on the internet.” We might put this in other terms: CoTs from reasoning models make much better use of per-token reasoning budgets.

Let’s think about it like this: the researcher is saying that old CoT had unevenly distributed information density. In order to produce a token with high information density, you need to do a lot of reasoning. Producing a token with too much information density relative to the preceding tokens may exceed an LLM’s per-token reasoning budget, and in that case it gets the wrong answer or goes off track.

But if instead the information density is evenly distributed throughout the LLM’s output, this solves that problem. It means the LLM continuously does small amounts of reasoning, taking small steps through reasoning space, discovering small amounts of information at a time, occasionally changing direction. It never attempts to make huge leaps in reasoning, and so it is much less likely to exceed its per-token reasoning budget.

Whereas old CoTs, as explained in the X post above, were attempting to replicate inauthentic reasoning exemplars from their training data. These exemplars were often not representative of a natural process of gradual discovery, but rather an artificial set of steps designed to lead to a predetermined answer, with the result that certain reasoning steps were irreproducible when attempted from a state of authentic reasoning. Training and prompting models on this certainly worked better than trying to answer complex questions in a single step, but it would still sometimes lead to the LLM trying to do too much reasoning in its smaller intermediate steps.

whence reasoning CoTs?

But how did reasoning models learn to reason in the new way? I discussed this at length in my last post about o1 and scaling inference time compute, where I also made some observations that formed the basis for the present post. But since then, a lot more has come out about reasoning models—both the arrival of DeepSeek r1 and its report, as well as lots of chatter from researchers and users.

I can’t possibly treat that topic in its entirety in this post, but the key relevant point is that it appears that this new type of CoT is an emergent property of reinforcement learning. When LLMs are given space to roam with their tokens to answer verifiable questions and are rewarded accordingly, the strategy they converge upon is what we see in the r1 CoTs: many small, exploratory reasoning steps.

I think that reasoning budgets are a good way of accounting for much of what we see. The remarkable per-prompt reasoning budget of reasoning models is a sign of greater reasoning capability. What’s interesting is that we don’t necessarily see a higher per-token reasoning budget. In fact, what we see—smaller reasoning steps—almost implies the opposite: a smaller per-token reasoning budget. But the other way to think of it is that the models learn to reason within their per-token reasoning budgets more consistently: meta-intelligence.

what’s next

To me, it’s clear that reasoning budgets help us think about, and potentially answer, some important questions about LLMs, like why CoT works, why prompts fail, and why reasoning models behave the way they do.

But there is still a lot we don’t know. Certainly, reasoning budgets do not explain everything—at a minimum I don’t think the two types of reasoning budget are completely independent, and the type of reasoning being done probably matters. And for reasoning models, there are a lot of other interesting features in the CoTs that appear to be orthogonal to reasoning budgets (e.g., why the obsession with “wait”?).

In upcoming posts I will apply these ideas in more detail to practical prompting strategies, as it’s one area where I see a lack of useful resources. Most prompting resources are tragically superficial, and much of the commentary I see even from researchers suggests to me that few have actually tried stretching prompting to its limit on reasoning tasks.

And of course, there will always be much more to say on reasoning models and other LLM advances.

Happy o3-mini day!

Important: it’s not simply more tokens. In the CoT paper they do an ablation study prompting the LLM to just produce ‘…’ as extra tokens. This shows no improvement over the baseline. So the extra computation needs to be in the form of relevant tokens. In this post when I say that more tokens/computation lead to better results, I mean tokens/computation used to carry out relevant, intermediate reasoning steps, not just any old tokens/computation thrown at a problem.

Of course, it’s difficult to quantify reasoning in this way, but the principle is intuitive. You could do an experiment of multiplication problems with different sized LLMs and calculate it roughly, though certainly reasoning budgets are not the same for all classes of problems.

Actually, if you have any examples of an LLM displaying sustained, good reasoning over tens of thousands of tokens, please show me!

Of course, this is only applicable for problems where you know the solution shape ahead of time. This is often the case in industry applications, where the principal benefit of LLMs is the provision of scaled intelligence, not of superhuman, nor even isohuman, intelligence.

Important: this assumes your prompt is actually good, e.g., the steps it elicits from the LLM are within its per-token reasoning budget. Big if. I’m not taking long term clients right now but have some availability for one off consulting.

llmpromptu

Discussion about this post