Time Is All You Need

it's only a matter of (test) time (compute)

Nov 17, 2024

When OpenAI announced the launch of the o1 models in September, the excitement was not so much about the better performance of the models (the gains are impressive in some areas, and modest to non-existent in others), but more about the new scaling paradigm the models represent. Specifically, these graphs in the research post:

The laws governing train-time compute are well-known, but, to my knowledge, most work on eliciting better performance from LLMs at test-time was not happening in a scaling paradigm. There were plenty of things you could do at test-time to get better performance (e.g., prompting strategies, repeated sampling, verification, etc.) but there were not generally accepted empirical scaling laws for test-time compute.

The graphs above indicate that scaling laws for test-time compute have arrived, though the details are, of course, not revealed by OpenAI.

In this post we’ll look at what makes o1's test-time compute scaling significant, what we know about how it works, and what questions remain unanswered, and think about how we might answer them.

Note: the first few sections below explain train-time compute and test-time compute. If you know all this and want to skip ahead to the meatier sections on o1, I recommend jumping to the section titled what we know about o1.

what is test-time compute?

It’s easier if we start with train-time compute. Train-time compute is the total amount of computational work used to train a model, usually measured in floating point operations (FLOPs). The paper Scaling Laws for Neural Language Models introduced empirical laws showing how model performance (measured as cross-entropy loss on the test set) scales predictably with the amount of compute used during training, alongside dataset size and model size.

The crucial part is the “predictably”: given some combination of the three factors (compute, dataset size, model size), you can reliably predict the model’s performance. That allows you to do things like optimize your dataset size and model size for a given compute budget. It also allows you to train a model 10x bigger than anything that’s ever been seen before, and know ahead of time approximately how good it will be. It’s not so much that more compute is better—that is intuitively obvious, as a general rule—but rather, now we can predict how much better it will make a model.

So, then, what is test-time compute? It is simply the total amount of computational work used at test-time (or at inference-time, if you don’t like to think of everything as being a test). For a given LLM, this amounts to the number of tokens generated. More tokens spent on answering a question mean using more test-time compute. Generating fewer tokens means less test-time compute.

does test-time compute scale predictably?

No. At least, not in a way analogous to the scaling laws for train-time compute, outside of OpenAI’s claims about o1, the details of which are secret.

The problem with test-time compute is that there is not, at present, some simple relationship between it and model performance. You can’t just prompt an LLM to generate more tokens and somehow get better results. (Again, excepting o1, apparently).

This is fundamentally different from training a model. Generally, assuming appropriate dataset and model sizes, you know that you can train a model for some amount of time and see better performance. It won’t improve forever, but the computational work of training the model is precisely the mechanism by which it improves.

But at test-time, the situation is very different. Certain prompting strategies, like chain of thought, produce longer outputs (i.e., more tokens and therefore more computation) and do, as a result, reliably produce better outputs. It’s probably true that longer chains of thoughts, up to a limit, produce better results, but that limit is not very high for most models right now. We are not working with orders of magnitude—a 100-step chain of thought produced by a single LLM call is probably rarely, if ever, better than a 10-step chain of thought, because the model will likely get lost and confused somewhere along the way. And a 1000-step chain of thought elicited from a single prompt is pretty much inconceivable, with current models.

So, although there are ways to scale test-time compute right now (see the next section), there is nothing comparable to how train-time compute is scaled.

how do you scale test-time compute?

Even in the absence of empirical scaling laws for test-time compute, you can to a certain extent “scale” it, in the sense that there are ways you can use more test-time compute to get better results. The critical point, again, is that there are no known scaling laws, and the returns fairly quickly become zero, or even negative.

The simplest strategy is chain of thought (CoT) prompting, where you instruct the model to break down its reasoning step by step before arriving at an answer. This works well and is helpful for pretty much any reasoning task1.

CoT prompting works because it allows the model to use more computation (i.e., generate more tokens) to solve the problem. This seems a little bit counterintuitive, because the whole point of CoT is to break the problem down into smaller steps. A smaller step requires less computation than a larger step to solve correctly. But by taking more smaller steps, you use more compute overall than if you were to solve the problem immediately.

But to be more precise, the critical part of CoT is that it allows you to effectively use more computation at test-time. A longer (more computationally expensive) response is not better per se; rather, it’s when the additional compute is applied in a certain way that you get a better result. CoT prompting works when the problem is beyond the model’s capability to answer in a single step, but the smaller steps are within that capability. By proceeding through smaller steps that it can correctly execute, the model can eventually arrive at the final answer—which it could not have arrived at if it attempted it in one step. If the problem is broken down into 5 steps, then it uses 5 times as much compute as it would if it tried to answer the question right away.

With zero-shot CoT prompting (i.e., merely appending the instruction “think step by step”), the model decides how to break the problem down into steps. But this is something the user can also control. Either by giving more specific instructions in the prompt as to how to break down the problem, or examples, or by employing another strategy for scaling test-time compute: prompt chaining.

Prompt chaining is another way to apply more compute to a single problem. In a sense, it is very much like a chain of thought—the key differences are that 1) it’s spread over multiple inference calls, whereas CoT prompting is conventionally a single call; and 2) the user directly specifies the structure of the reasoning path. When you chain prompts, you are deciding exactly what each subtask should be, rather than letting the model break the problem down itself (though of course, you could, and probably would, use CoT prompting within a prompt chain; they’re not mutually exclusive).

Another way to scale test-time compute is repeated sampling: run the same inference multiple times, and use some sort of reranking or voting mechanism to choose among the results, assuming they’re not all the same. For problems where answers are automatically verifiable, e.g., in certain coding tasks, you don’t need to choose the best candidate—you can just attempt all of them and see which one works. But otherwise you do need something else—a simple majority vote is the easiest, but there are other selection strategies. Another option is to use a second verifier model, for example, a process reward model, that is trained to (help) choose the best response.

There are also various strategies that apply some sort of combination of these things. For example, tree of thought or self-reflection, where the model iteratively generates new reasoning paths and selects which ones to continue or improve.

what’s the problem?

All of the strategies mentioned are helpful in using more compute to elicit better outputs from LLMs. But they don’t seem to scale in the same way that compute at train-time does.

As mentioned, current models can’t productively output longer and longer chains of thought beyond a pretty restrictive limit. Prompt chains can more effectively scale out reasoning chains, but they require someone to design the prompt chain for a specific problem.

Repeated sampling is perhaps more promising. For one, you can generate new samples for as long as you want, and each sample is independent, so it doesn’t require the model to focus indefinitely. And for two, you don’t have to do anything problem-specific, it’s a task-agnostic strategy. Some of the evidence shows that increasing samples scales for a couple orders of magnitude when the goal is coverage—i.e., the model is judged as succeeding as long as at least one of its samples is correct. But this is contingent on the problem being amenable to automatic answer verification, and many important problems are not of this type! Other work shows that when the sampling results are used to select a best candidate through voting (i.e., not a coverage situation) then performance sometimes decreases with more samples. In other words, it doesn’t scale.

What all of this means is that there’s no straightforward way to simply press a “more compute” button at test-time to get predictably better results, in the way that there is (more or less) at train-time. Yes, there are specific strategies that reliably get better results at test-time, but they don’t scale nearly as much as the analogous methods for training.

This is why the o1 model family, and the charts at the top of this post, are so significant. It means that OpenAI has developed a type of model that has this “more compute” button available at test-time. Though they admit in their announcement posts that the “constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them”, it seems to mark the advent of a new paradigm in understanding and eliciting LLM capabilities.

what we know about o1

In my post shortly after the launch of o1, I wrote some commentary about how the models seem to work. Since then, I’ve familiarized myself more with the literature around test-time computation and what other researchers have been saying, so I can say a little bit more from that perspective now.

The fundamental question around o1 is this: how do you get a model to use additional test-time compute not only effectively, but also in a way that scales?

This question naturally decomposes and leads into a lot of other questions.

First, how do you get a model to use additional test-time compute effectively? Most of this post up to this point has been an attempt to explain the current ways of doing this. It’s clear that what o1 is doing is a form of two of the basic methods: chain of thought, and repeated sampling.

Regarding chain of thought, OpenAI’s announcements call this out explicitly, and the example reasoning chains they give are rigorous applications of chain of thought. In my earlier post I pointed out how these chains of thought seemed different from what I was used to seeing from LLMs. The o1 CoTs show a persistence and focus on their subjects that will be remarkable to anyone who has struggled to get LLMs to do this. The other remarkable property of these CoTs is that they show the model backtracking, trying new possibilities, reflecting, and so on. I’ll say more about this later.

But the third remarkable characteristic of the o1 CoTs, and one which I did not realize at the time, was brought to my attention a few days ago when I saw this post from one of the researchers who worked on o1 (and who, probably not coincidentally, helped invent CoT in the first place). The basic idea is this: conventional CoT in LLMs is not an authentic “chain of thought” the way humans think. It is a pattern match or reproduction of CoTs seen in training. But the types of data that LLMs are usually trained on are not organic CoTs—they are moderately to extremely refined pieces of writing. A textbook, for instance, is not a naive attempt to arrive at some answer or truth. The answer was already known before the textbook was written, and the purpose of the textbook is to reveal that answer in an instructive way. This means that when an LLM invokes a CoT, because it is pattern-matching its training data, in a certain sense its answer is predetermined, even if it seems to naturally be taking one step at a time towards a not-yet determined answer. But the CoTs in o1 are, supposedly, authentic: the model is organically discovering small amounts of information step by step, and does not know where it will end up, just like a human does when they are actually thinking—not when they are writing up the results of their thinking. Thus, the stopping to consider, the “hmm”s the backtracking, etc. This is encapsulated in what to me is the most interesting part of the post, its last sentence: “the information density is *much* more uniform in the [o1] chain of thought than average text on the internet.” In other words: o1 mimics real thought; other LLMs do not.

Now, moving to the second component of o1 after CoT: repeated sampling. This part is less immediately obvious. It’s clear from the reasoning examples from the launch announcement, that o1 is able to try different solutions to a problem. It’s able to recognize when one approach isn’t working, and pivot to a new one, or go back and correct its mistakes. This is analogous to repeated sampling in that it’s the same model trying different solutions. The major difference, of course, is that the model appears to be doing it in a single inference, rather than multiple, independent inferences.2

What this really is, in fact, is search. Many people recognized this immediately when o1 came out, though I didn’t fully grasp the implications at the time. What o1 shows the ability to do is search multiple paths—in the sense of computational paths of token sequences, which manifests as semantic paths of different chains of thought. The tricky thing about search in this context is that it’s not merely about trying different things—it’s about knowing when to try them. LLMs (notoriously) stick to the path they get started on, but o1 remarkably changes direction on its own, deciding when it needs to consider a different approach, or even pausing to think.

In the next few sections I’m going to go into more detail on how o1 might be doing what it’s doing. This is based on reading some of the relevant literature, reading commentary from researchers, and, of course, thinking about it a lot. So far, the best treatments I’ve seen on how o1 probably works are from Interconnects, specifically this post when o1 came out, and this post from about a year ago, when what would be known as o1 was being rumored about as Q*. I recommend reading those posts as the knowledge and experience there is far superior to my own. But since we don’t really know what’s going on behind the scenes until OpenAI tells us, I will give my own take.

what we don’t know about o1 — model or system?

o1 was trained to reason with reinforcement learning, and that is evident in how its chains of thought explore different states (i.e., token trajectories, reasoning paths, rollouts, whatever you want to call them). A key question is whether o1 is generating multiple distinct paths at inference time, and then choosing the best to continue with, or whether it is actually a single pass of autoregressive generation changing direction on the fly.

The first Interconnects post linked above seems to argue that the answer is the former. And various other commentary I’ve seen online also suggests that. This makes sense: if you had to envision how to build o1, you would picture something like an exploratory generator model and a separate reward/verifier model choosing which paths to continue on. That would be a somewhat straightforward application of reinforcement learning applied to language modeling. Look around (try new strings of text) and get rewarded or punished for it.

The reason I’m cautious about fully committing to that assumption is because it seems OpenAI has gone to great lengths to claim that it’s not a system with multiple components. For one, they’ve explicitly claimed that o1 is a model, not a system.

For two, although in practice the reasoning tokens generated by o1 will be hidden from the user, in the o1 announcement they showed what were ostensibly the true reasoning tokens behind the examples. Of course, one could take the different sampled paths and the choices of the reward model and prune that to a single chosen path. But if so, then the presentation of the reasoning tokens in the announcement would have been very misleading. It’s not that I expect OpenAI to be forthcoming, or even that I expect them (or any company) not to be misleading in its press releases, but still, misrepresenting the reasoning tokens in that way would strike me as unnecessarily deceptive.

I am still learning about this area so I won’t say too much more on this front before I embarrass myself, but: naively, I don’t see why you couldn’t do something like:

run the state exploration + reward in the data prep phase
take the reward maximizing paths and prune them to a single path, including the changes of direction, backtracking, etc.
use the pruned path as the actual training data
yield a model that can do its “search” extemporaneously at inference time, in a single generation

I don’t know enough about the nuances of the behaviors RL can instill in LLMs so I won’t speculate too much more here, but I would at least be interested to try something like the above.

At any rate, it will be interesting to see how open source LLM providers implement test time compute scaling when they inevitably manage to do so. Or, perhaps Anthropic or Google will actually explain how they do it when their turn comes.

what we don’t know about o1 — reward?

Regardless of whether the reward model was used only in training or is used at inference time, what is the reward? That is the most important question to ask if we want to understand o1. We know that o1 is trying to maximize reward, so what is the reward for reasoning?

Reward models for simple problems—say, evals LLMs were struggling with a year and a half ago—can rely on labeled outcomes for their training data. You can spend some (or a lot of) money to get people to create or evaluate reasoning paths to solve your middle school word problems.

But as you move to more complex problems, getting those labels is increasingly expensive—generating sufficiently complex problems is itself difficult and requires domain expertise, not to mention solving them, generating explanations, and correcting the model’s attempts to solve them. And on top of that, the major AI labs are (hoping to be on) a trajectory where their models ultimately discover new knowledge—that is, they have to be able to solve problems of a type for which there are no known solutions. So where do you get a reward signal that can take you there?

If the desired outcome is not known ahead of time, then naturally you want a process reward model (PRM). PRMs assign a reward at each step, rather than once a solution is reached. This makes sense for the way o1 works: it appears to be evaluating its reasoning at each step, that’s why sometimes it changes direction or stops to think. Again, whether that is actually the result of a PRM being run at inference time against multiple candidate generations from the LLM, or a learned behavior from extensive training under a PRM, we don’t know, but perhaps either could explain what we’re seeing.

So then how do you develop a PRM that can do this for increasingly complex reasoning problems? This is where things get interesting, and at least for me, speculative. Here are a couple thoughts.

One, if you are a tremendously well financed company, then you are able to pay real experts in many fields to create the highest quality data to train your PRM. Even if this data consists of problems with known answers, you’d still expect it to generalize to other difficult problems, which may or may not have known answers. This can probably get you quite far compared to non-expert data, but to what extent can it surpass the knowledge intrinsic in that data? I’m not sure, but if this is all that’s going on, then the secret behind o1 is just “we paid more money for more and better data”—which is probably part of the answer, but unlikely to be all of it.

Two, there are probably some universal, or pretty general, principles behind what constitutes “good reasoning”. The most obvious, or at least nameable, examples are things like avoiding formal and informal fallacies, but in the context of LLMs specifically, things like “doing math correctly” or “double checking the context for relevant information” and so on are also important, and you can think of increasingly specific ones as you go further into a particular domain, culminating in a taxonomy of reasoning errors (or best practices). One might hypothesize that the path to any new knowledge will follow these same principles—the difference will primarily be the content of the reasoning steps along that path.

So, rather than merely collecting a huge amount of high quality data and training a PRM on it, you might architect your PRM to verify whether principles of good reasoning are being followed. In a way, this would go against the trend of deep learning using entirely learned representations, but I think it’s an interesting approach to consider. For instance, you might train multiple PRMs, each functioning as a verifier for a specific types of reasoning error. For example, a verifier that checks that any arithmetic is done correctly (i.e., a calculator), a verifier that checks whether an assumption is consistent with the problem statement, a verifier that checks for modus tollens, a verifier that checks for consistency with previous claims in the reasoning steps, etc. And perhaps at the time of assigning reward, there’s first a classification step that determines the most relevant PRM(s) to invoke.

A benefit of this approach is that it instills a form of interpretability and/or steerability into the model. You are choosing what constitutes good and bad reasoning for the model in principle.

The interesting thing about this approach is that many of the reasoning errors I mentioned can be verified easily by the LLM itself. These individual verifiers could be implemented as prompts for the same model that serves as the generator—in other words, by using more test-time compute3. The more powerful the generator LLM in the first place, the more general each verifier can be. And so, it might in the end make sense to just have a single PRM, but one that is checking for every reasoning error in the taxonomy.

Something like this must be trivially true. If o1 is successfully optimized to perform good reasoning, then its reward model is rewarding whatever is intrinsic in reasoning steps that constitutes “good reasoning”—we just don’t know exactly what that thing is or how OpenAI has modeled it.

what we don’t know about o1 — scale?

Even if we understood the reward behind o1, we still wouldn’t necessarily understand how it’s scaling—or at least how that leads to the discovery of scaling laws.

The question returns to this: how do you press the “more compute” button at test-time? The answer is, if you have a reward model. The longer you generate, the more reward you accumulate, and so the better reasoning your answers will have. But how do you actually implement that in practice? How do you make it so that your LLM generates better answers the longer you generate?

For a fixed compute budget at test-time, with a generator + reward setup, you’d have several parameters. First would be breadth-search: trying different paths. The more discrete paths your generator generates, the better the chance of finding the right path. Second would be depth-search. Each discrete path can be generated for longer before deciding which one to pick. The longer each path, the more accurate the reward model’s assessment of whether it is likely to be a promising path to follow. You could also allocate some of the compute budget to the reward model—in fact, you could imagine that the reward model also generates its rewards in the same way as the generator, and can allocate its own compute budget to breadth and depth—search all the way down.

In this picture the scaling laws emerge as you discover the optimal allocation of compute budget to these different parameters. For a given amount of compute, how much generator-breadth and how much generator-depth? How much reward-breadth and how much reward-depth?

If it’s just a single autoregressive generation, then under the simplest assumptions there are no knobs to turn. Your compute budget is allocated entirely to generating more tokens in the single path you are on. I’d speculate that to do this effectively you’d have a variable length penalty in training. You’d need some kind of length penalty in the RL phase anyways, so that the model doesn’t just keep generating tokens to accumulate infinite reward from the PRM. The generator has to give an answer eventually, or else it’s not actually helpful, and the length penalty enforces that. So you might train the model under different length penalties, such that it learns the optimal reasoning strategy for each length, which means learning the optimal reasoning strategy for different amounts of compute. In practice this would amount to something like the parameters in the first setup—how often should it consider a different path, how long should it go on a single path? Perhaps at test-time there is even a step where the model estimates the optimal amount of compute time and generates accordingly. Or in the future, the user will be able to choose how much compute they want to spend on the generation.

The key difference is that in the first setup, where you have a system consisting of a generator and a reward model, the user of the LLM can choose how to allocate compute to different components of the system, like breadth and depth. The optimal strategy is discoverable by testing differently allocated compute budgets and seeing how they scale empirically over larger total compute budgets. In the second setup, where it is just a single model generating a single stream of tokens, then it’s the model that has to learn the optimal strategy in training. You learn how it scales empirically by seeing how its performance improves over longer generations, with no strings attached.

This is how I’m thinking about it, but I’m sure there are many clever things the big players have up their sleeves I’ve not thought of.

self-improvement and paradox

Until OpenAI announces what they’re doing or other labs follow up and do the same, there will remain a lot of unanswered questions. I’ll close this post on a few bigger picture notes.

The most important thing about being able to scale test-time compute is that it gives you another lever for better model performance. We can scale train-time compute, now we can scale test-time compute on top of that.

But this also suggests an obvious question: if we can use more compute at test-time to get better results, can we use that data to train an even better model? Can a model like o1 recursively improve itself?

It’s interesting to think about what this means, and to me it sometimes feels paradoxical. Whatever data was used to train o1 resulted in its current capabilities. Whatever outputs we elicit from o1 are, by definition, outputs it is already capable of producing. So how can feeding these outputs back into it make it even better? How can it learn something from itself? This feels puzzling in a philosophical way.

And yet, this question is actually not unique to o1. Prompting for any other LLMs already elicits improved performance. But again—by definition, that performance was already in the model. We are not updating the weights when we prompt a model. It can already do the thing you’re asking it to do. You might not have known that until you did it, but your doing it didn’t change anything. This is part of why prompting is so interesting: there is a world of unexplored capabilities already within models that you can elicit with the right combinations of words. In an earlier post on prompting I predicted that LLM providers would feed the well-known prompting strategies back into models to improve them, and in fact that was already clearly happening at the time, as many LLMs were giving step-by-step answers even when not prompted to do so.

The difference with o1 is that we can, apparently, scale those improved capabilities to multiple orders of magnitude. A better prompt might in many cases give you, say, 20% performance improvement in a 100 token response. But let o1 generate 10 million tokens to answer that same question, with a similar improvement for every 10x tokens, and you’re looking at a 300% improvement over the baseline. Numbers like these are cute when the frontier-level problems are things like grade school math, exciting when the frontier-level problems are things like writing code, and difficult to comprehend when the frontier-level problems are things like doing novel research.

We will see what unfolds as o1-preview gives way to o1-full, and other labs release their own reasoning-optimized models. For now we peer into a realm dark like the night sky, with only a few points of dazzling light in the distance, stars coalescing into something brighter, a constellation…

🍓

resources

Here are some of the papers and blog posts linked above, as well as some other ones that are relevant to scaling test-time compute.

Scaling Laws for Neural Language Models

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems (I’ve only glanced at this one)

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Let’s Verify Step by Step

Reverse Engineering OpenAI’s o1 from Interconnects

Q* from Interconnects

I wrote about this and similar strategies in an earlier post, though at the time, the concept of test-time compute was not yet on my radar.

llmpromptu

Discussion about this post