o1: LLMs and Reasoning FAQ

🍓

Sep 19, 2024

On September 12th, OpenAI announced and released two models in the new o1 family: o1-preview and o1-mini.

There’s a lot to say about these models, and I think it’s best structured as an FAQ. So here it is.

what are the o1 models?

It’s explained well in the system card OpenAI published:

The o1 large language model family is trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long chain of thought before responding to the user. OpenAI o1-preview is the early version of this model, while OpenAI o1-mini is a faster version of this model that is particularly effective at coding. Through training, the models learn to refine their thinking process, try different strategies, and recognize their mistakes.

The o1 models are trained to be better at reasoning, specifically by producing extensive chains of thought before producing their final answer to the user. In the chain of thought, the models do things like break down problems into smaller components, consider different possibilities, and so on, much like a human might.

o1-preview is an early checkpoint of a larger model (thus preview), and the more advanced checkpoint will be released later. o1-mini is a smaller, but fully trained, model.

does o1 do better on evals than GPT-4o?

From a teleological perspective, the o1 models are built to be better reasoners1. To that end, the models appear to be a success. In their evals, OpenAI reports that the o1 models are substantially better than GPT-4o on various math, science, and reasoning (e.g., LSAT) benchmarks.

Reasoning is a pretty central capability of LLMs, but certain tasks require other forms of LLM-cognition, and so do not benefit as much from improved reasoning capacity. This is also borne out by the evals, where o1 has zero or minimal improvements over 4o on evals like AP English Language/Literature and Public Relations. Evidently success in these domains is not helped much by better reasoning, at least as realized by LLMs in chains of thought.

how does it work?

In OpenAI’s technical research post, they explain it like this:

Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason.

In terms of how it works from the user’s perspective, the o1 models produce two separate sets of tokens. The first are chain of thought tokens, which are where the model “thinks” about the problem. That’s where it will break down a problem into smaller steps and try different approaches. The second are the actual output tokens where it gives its final answer, based on its chain of thought tokens. Both sets of tokens will count towards the cost when using through the API, but only the final output tokens will be shown to the user—the chain of thought tokens will remain hidden (for more on this, see later in this FAQ).

However, there is also a third set of tokens, which are a summary of the chain of thought tokens. The user can elect to see the summary of the chain of thought, though I’m not sure how this will work through the API.

can you explain how it works in your own words?

OpenAI is (unsurprisingly) skimpy on the technical details, but here’s how I would explain it based on what they’ve said and some reasonable inferences of my own.

There are two basic ideas behind the o1 models. The first is chain of thought as a reinforcement learning strategy. Chain of thought is a well known prompting strategy where you instruct the model to think step by step, which reliably yields better results on reasoning problems (more on this later in the FAQ). To some extent, LLMs have already been trained on this strategy, as they will often approach certain tasks, such as simple word problems, with an explicit step by step strategy. What OpenAI appears to have done with the o1 models is train them much more extensively on more complex chains of thought. This should make them better at reasoning through a wider and more complex set of tasks.

The second idea is giving the model time to think. In their announcement, OpenAI writes:

We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).

By giving the model more time—or really, by allowing it to generate more chain of thought tokens—before producing its final output, it gets better results. This is what OpenAI is calling “test-time compute”, i.e., the amount of computation (which can be represented by # of tokens generated) at test-time (as opposed to training time). The major research finding here is that test-time compute scales similarly to train-time compute. In other words, just like how training a model for longer produces better outputs, allowing the model to think for longer also produces better outputs.

what data is o1 trained on?

From the system card:

The two models were pre-trained on diverse datasets, including a mix of publicly available data, proprietary data accessed through partnerships, and custom datasets developed in-house

They say a little bit more about the data in the system card but nothing of any real substance.

I assume that the vast majority of the chain of thought data was synthetically generated, and that synthetic data generation process is perhaps the most interesting (to me) unknown detail behind the model. Presumably they prompted GPT-4o to produce chains of thought, probably multiple chains of thought for the same problem using different strategies, and even then iteratively expanding those chains.

But of course, high quality chains of thought from domain experts would likely be the most valuable data. Unfortunately we don’t know what that looks like or who they would have gotten it from.

why chain of thought?

Chain of thought is a popular prompting strategy (see here for the first paper on it, and a quick Google search will yield countless papers that follow it). Prompting a model to work through a problem step by step and make its reasoning explicit gives better results than just commanding the model to solve a problem. The chain of thought prompting can be in the form of giving example problems worked through step by step, but even the simple instruction to “think step by step” without any examples was shown to be very effective.

One way of thinking about the o1 models is that the chain of thought prompting strategy has been trained into them, so that you can get those better results “for free”, i.e., without needing to prompt the model to do anything special.

In this connection it’s probably worth mentioning (if I may boast) that the strategy behind the o1 models is essentially what I predicted in May would happen with future LLMs:

More training data, more parameters, and some other assuredly clever innovations will of course be helpful. But plausibly, better fine-tuning is the only necessary solution. […] It’s not hard to envision a fine-tuning process that would more reliably elicit these [better reasoning] capabilities. […] I expect that the major LLM providers are already working on this […] somehow “folding in” the performance gains from the well-known task-agnostic [prompting] strategies [including chain of thought] […] I expect that their future counterparts will be fine-tuned on data using any analogous reasoning strategies

My point in that post was that prompting strategies like chain of thought are task-agnostic: they generally yield better results regardless of what the task is. Therefore, it makes sense to train models to adopt those strategies without needing the prompting, and that appears to be more or less what the o1 models are the result of.

is o1 just GPT-4o with chain of thought prompts?

If the o1 models are essentially just GPTs trained on GPT outputs prompted with chain of thought instructions, then maybe it’s not really any different than just prompting GPT to use a chain of thought. Is that true?

OpenAI claims it is not. According to this, OpenAI said that GPT-4o cannot match o1’s CoT performance through prompting alone2.

But regardless of what OpenAI says, based on the examples provided in the technical post, I would believe their claim. I was particularly struck by the chain of thought in the Cipher example. I was stunned at how long the chain of thought is—not because I can’t imagine why it needed to be so long, but rather because I have never seen an LLM focus so doggedly on a task, ever. Even when given meticulous step-by-step instructions to follow, in my experience the tendency of LLMs is to rush through the steps and ignore details. Clearly, the o1 models are capable of producing far longer and more complex chains of thought than could be achieved by mere prompting. This is exciting news for anyone working on complex reasoning tasks with LLMs.

is it just chain of thought?

The term “chain of thought” in the context of LLMs is so heavily associated with the prompting strategy that it’s important to make a distinction here. Although the idea is similar, based on OpenAI’s description and the examples they provide, the chains of thought produced by o1 are not instances of a strict “chain of thought” in the prompting sense.

Again, the Cipher example is illuminating, because it shows that the model is not merely breaking down the problem into a couple of steps, it is really thinking about the problem the way a human would: yes, thinking through smaller steps, but also considering different ideas, reflecting on them, noticing when it’s confused, and so on. In other words, the o1 chains of thought appear to be much closer to a “chain of thought” in the conventional, non-technical sense.

But the connection to prompting strategies is still worth considering. My qualitative sense is that nearly all reasoning-specific prompting techniques can be reduced to attempts to get LLMs to produce outputs that optimize for two things: path adherence and multiple path exploration.

I may expand on this more in a later post, but the basic idea is this: one way of viewing the problem LLMs have to solve is to find (one of) the correct sequence(s) of tokens from the set of all possible sequences of tokens. The possibility space is nearly infinite, and the solution space is nearly infinitesimally small. A good prompt is good precisely insofar as it causes the LLM to produce one of the sequences of tokens in the solution space. In order to maximize the probability of producing those sequences, it’s a given that you have to provide the task definition and relevant context in the prompt—I think of this as making sure that the model starts on the right path in its token generation. The problem is that there are many, many possible correct ways to start a path, but it’s very easy to diverge from them as you continue generating tokens. That is where path adherence comes in: prompting strategies like chain of thought ensure that LLMs stay on the correct path once they’ve started. The way they do this is by forcing the LLM to make what I’ve been calling “minimal inferences”. By breaking a problem down into smaller problems, the LLM can make smaller inferences at each step, and thus is more likely to be able to execute that step correctly, and thus stay on the correct path. This is good so far, but by choosing only one path, you greatly reduce the probability of finding a path that is in the solution space. That is where the second part comes in: multiple path exploration (or, simply, search). By prompting the LLM to explore multiple paths (which is another popular prompting strategy, e.g., tree of thought) you increase the probability of finding the correct path.

Although I usually think of this in terms of prompting strategies, that behavior can also be induced in an LLM by fine-tuning/reinforcement learning. The o1 chains of thought appear to be a more extensive application of these principles, adaptive to the particular problem it’s faced with, as learned from training.

So, to put it succinctly: no, it’s not just chain of thought prompting, it’s more like the application of various reasoning strategies—but trained to do them with greater complexity and persistence across a variety of problems than a single prompt could achieve.

do you prompt it the same way as other LLMs?

I am seeing conflicting things on this.

OpenAI says here:

These models perform best with straightforward prompts. Some prompt engineering techniques, like few-shot prompting or instructing the model to "think step by step," may not enhance performance and can sometimes hinder it.

I can see why “think step by step” (the archetypal chain of thought prompt) may not enhance performance if the model is fine tuned to do this already, but why would it hinder it? Intuitively, I would expect such instructions to be redundant at worst, not counter productive. What’s going on here? I don’t know yet.

This is somewhat concerning to me, as in my professional life I have to write prompts that provide extremely detailed reasoning instructions. Although I do instruct the model to think step by step, I also meticulously lay out what those steps are. I’m a little bit concerned that the o1 models may, in a sense, take reasoning into their own hands—i.e., ignore the substantive instructions in a prompt and focus solely on the basic question being asked and use its chain of thought to develop its own strategy for answering the question. This is fine, and even good, in cases where you don’t have those steps already figured out—that’s the whole point of the model generating its own chain of thought. But if I know how the problem should be solved, then I want the model to follow those steps, not come up with its own.

On the other hand, the fact that the chain of thought examples shared by OpenAI show the o1 models staying on track for a very long time and recognizing their own mistakes is promising. That is exactly what I would like these models to do in the use cases I’m talking about3.

But then there’s this tweet that summarizes what OpenAI said on their o1 AMA:

- o1 benefits from prompting styles that provide edge cases or reasoning styles
- o1 models are more receptive to reasoning cues in prompts compared to earlier models

This seems to contradict the statement directly from OpenAI above. But it’s also pretty vague—I’m not sure what exactly is meant by “reasoning styles” or “reasoning cues”, for example.

So, for now, I’m not sure what the best prompting strategies will be for o1 models. My intuition is that they won’t be radically different, but that is something we’ll have to learn as we use these new models.

is o1 actually better than 4o?

The evals (see above) show that o1 is better in certain domains, but offers no improvements in others.

But what are actual users saying? There is a good collection here of reactions to o1.

I haven’t yet used it enough myself to form a strong opinion from my own experience, but my overall sense from OpenAI’s evals and reading what others have said is something like this: o1 represents a noticeable improvement in reasoning over 4o, especially in math/science/coding, but it’s clearly not an immense leap forward—it’s kind of like 4o executing very well on certain types of problems, but does not feel like a categorically better model, in the way GPT-4 was compared to GPT-3/3.5.

That being said, o1-preview is an early checkpoint of the model, so we know there is a better model coming, thought I suspect that will also not be perceived as an immense leap forward.

why isn’t OpenAI revealing the chain of thought tokens?

Here’s what they say in their technical research post:

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.

So, the main reason is ostensibly that it’s important to keep the chain of thought in its unaltered form in order to monitor the model, but that means the chain of thought may not be compliant with OpenAI’s policy / safe for consumption, so they don’t want to show it to the user. They then also sneak in “competitive advantage”, which may play a larger role than they’re admitting.

I was initially surprised to see that they were hiding the chain of thought tokens, because those tokens are the whole point of the model. But after reading their reasoning, I think the decision makes sense from a safety/alignment perspective. I don’t like it from a practical user perspective, but I understand why they’re doing it.

The main potential problem I see for the types of things I work on is that often my prompts are structured in response to the types of mistakes that I’m seeing the LLM make. With non-o1 models you see everything that the model outputs, so you know more or less what they’re thinking and why, and can adjust your prompt to correct that thinking. But with the o1 models, which are thinking much more than other models, not being able to see the chain of thought means that you’re flying blind. It’s not totally clear how reliable or useful the summaries of the chain of thought will be. Maybe they will be close enough, but it’s far from ideal.

what does o1 mean for frontier models?

The most important LLM releases in the past few years have either introduced superior capabilities across the board, or reduced costs for already attained capabilities. The o1 release is unique in that it is definitively not an increase in capabilities across the board, and it’s more expensive.

In this way it’s kind of a lateral move. It is not GPT-5, it’s not Claude 4, etc. It’s a different type of model that is better at one thing: reasoning. To be fair, reasoning is extremely important for LLMs; it’s perhaps their most important capability and helps with many different tasks. So if you’re going to release an LLM that’s an improvement in just one capability, reasoning makes sense.

But it raises the question of what’s coming from other LLM providers like Anthropic. If/when OpenAI releases GPT-5, I assume that Anthropic will have Claude 4 out the door not too long after. But I do not assume that Anthropic is about to release a1. It’s not that they couldn’t build that type of model—they easily could. But it would be, again, a lateral move from the pattern of better or cheaper models. Perhaps we are witnessing the beginning of the proliferation of more specialized models from the major LLM providers. Or perhaps o1 will fizzle out, never to be followed by o2, or be subsumed by GPT.

what does it all mean?

Here are my big picture thoughts:

o1 is extremely interesting from the research side, as LLM reasoning is to me the most interesting and important aspect of LLMs. I would love to know more about how they generated the data and the RL strategy.
The test-time compute finding is partly important (it’s log scale like train-time compute) but also partly a no-brainer: we already knew that letting LLMs think more gets better results because that’s what nearly all the good prompting strategies have converged on!
Practically speaking, o1 represents a clear improvement in reasoning-heavy domains, thought not at the level of a categorically superior model like GPT-5. But it remains to be seen exactly how impactful o1 will be. If you have any insights, let me know.
Will o1 change the competitive landscape, and influence the work other companies are doing?

What is the telos of other LLMs? This probably deserves a “history of language models” post to really answer. If you look at what they actually do, which is merely generate tokens one at a time, then in a sense their telos is next-token prediction or traditional language modeling. But if you do this well enough by training on enough data and instruction-tuning and RLHF, then language modeling actually ends up being good at knowledge retrieval and executing text-based tasks and reasoning.

Although, the tricky and interesting thing about LLMs is that their “true” capabilities are to a certain extent always unknown, because they’re dependent on the prompt, and prompt optimization is not a solved problem.

I haven’t yet tested o1 on those types of problems as I’m on vacation in the wilds of [REDACTED] but I am eager to see how it goes once I’m back to work.

llmpromptu

Discussion about this post