When DeepSeek released r1, there was a flurry of reproductions doing RL on smaller models. Since then, we’ve also seen papers that report incredible results on math datasets with much smaller models, trained with tiny amounts of data.
The curious thing is that some of these studies don’t even use RL. Two of the papers that made the rounds recently, s1 and LIMO, exclusively use SFT. And yet both report results that beat o1-preview on certain reasoning tasks, using a 32B base model. How is that possible, when we’ve been made to believe that RL is the key to better LLM reasoning?
Alongside that, there has been a lot of discussion of the chains of thought (CoTs) produced by reasoning models, and how they are the key to reasoning with RL (I’ve written about this in detail here). But we know that LLMs have long been trained with SFT on datasets with CoTs—before the advent of reasoning models, many LLMs would produce CoTs to answer questions, even when not explicitly asked to. So why does RL work so much better?
In other words, there are some apparent contradictions:
RL leads to better reasoning capabilities than SFT does, but we see SFT outperforming RL on some reasoning problems with much less training data.
RL supposedly works so well because it teaches models to use CoTs, but models have long been trained with reasoning CoTs and been able to produce them.
In this post I will attempt to resolve these contradictions.
why does RL work better than SFT?
Let’s start with the second contradiction. Why does RL work so much better for reasoning than SFT, even though models trained with both methods are trained to solve problems with CoTs?
We should start by looking at the CoTs these models produce. Again, I have written about the peculiar nature of reasoning model CoTs, and I assume that anyone reading this has seen the output of models like r1, but to put it briefly: reasoning model CoTs are characterized by reasoning steps that are small, organic, and numerous. Base models1, on the other hand, produce CoTs that read like textbook demonstrations.
The different character of the CoTs produced by reasoning and base models must be causal in relation to their reasoning ability. In other words: reasoning models are better at reasoning because of the specific type of CoTs that they produce. Or better yet: the CoTs are the reasoning. The CoTs from reasoning models are better instantiations of LLM reasoning than the CoTs from base models. I think we must accept something like this, otherwise it’s not clear what to attribute the better reasoning performance to. If the CoTs are not causal with respect to the reasoning performance, then there must be some mysterious phenomenon that happens at the end of a reasoning model output that is more likely to give a correct answer than whatever happens at the end of a base model output, and this mysterious phenomenon would somehow have to be unrelated to the tokens generated beforehand. That is obviously untenable.
Does this tell us anything interesting? I think so. It tells us that until reasoning models for RL were developed, labs were training their base models on bad data. Regardless of whether it’s reward or loss or something else it’s optimized for, a model will produce outputs that look like the data it’s trained on. If reasoning models produce a different type of output than base models, and that output causes—or simply is—better reasoning, then: reasoning models are trained on better reasoning data than base models are.
This makes even more sense if we consider the source of the data used for RL on a reasoning model vs. SFT on a base model.
When a base model is trained on reasoning problems with SFT, the CoTs it learns from come from the dataset the researchers provide. The CoTs might have been annotated by a human, distilled from another model, etc., depending on the dataset. But ultimately it means that the CoTs are ones that a human thought would be useful for the LLM to learn how to reason from.
But with RL, the source of the CoTs is the model itself, through the exploration of the RL process. And the model is optimized precisely for those CoTs which produce the best results. In other words: in RL, the CoTs are by definition the ones that help the model learn best. And this is no surprise, because the whole point of RL is to iteratively discover and refine the optimal strategies, which for an LLM are its CoTs.
To put this succinctly: RL discovers LLM-optimal CoTs. LLM-optimal CoTs are the CoTs that best teach an LLM to reason, and by extension, the same CoTs that when generated by an LLM, yield the best reasoning results. CoTs from other sources, such as human annotators, may or may not be LLM-optimal. In practice, it seems that the reasoning datasets traditionally used for SFT had suboptimal CoTs, given that the CoTs from RL work so much better.
I find this to be a simpler and more coherent view of why RL has been so successful for reasoning than what I have seen elsewhere. I am more or less suggesting that RL is a very good (to put it mildly) data curation pipeline. Merely training on CoTs that end in correct answers is not enough. The CoTs need to have certain qualities that induce better reasoning behavior in a model. RL is able to discover CoTs with these qualities better than humans annotating or otherwise curating data have generally been able to.
In a sense this is trivially true: it’s not a secret that the point of RL is to learn good strategies, which in the case of LLMs means which tokens to generate. But I can’t recall seeing a clear statement that what this amounts to is finding better training data for models—the exact same type of data that could be used with SFT—and that we can actually characterize this data very well and contrast it with other types of data2.
To be clear: this is not a complete explanation of why RL works, nor of why it started working in the past several months. Things like verifiable rewards, the effect of optimizing for reward vs. loss, the discovery of better algorithms like GRPO, and engineering work to actually make the training work are all very important. I am not claiming to give the full story of RL for reasoning, but I am claiming to answer the question of why RL works better than SFT for reasoning, even though both rely on CoTs. And I think that tells us a lot about why RL works.
why does SFT work better than RL?
Now that we’ve established why RL works better than SFT for reasoning, we can turn to the first contradiction stated at the beginning of this post: if RL is better than SFT for reasoning, why does SFT sometimes work better than RL?
Two papers from earlier this year, s1: Simple test-time scaling and LIMO: Less is More for Reasoning, both use SFT to achieve better results on certain reasoning problems than o1-preview. The two studies are remarkably similar:
both SFT a Qwen-32B model on math problems
both report results that beat o1-preview on math benchmarks
both use a carefully curated dataset of ~1k high quality reasoning examples as their training data
These results are difficult to account for if RL is better than SFT for reasoning. To be sure, no one would claim that any implementation of RL is better than every implementation of SFT. But in these studies we have a presumably well-implemented RL reasoning model (o1-preview) pitted against a much smaller model fine tuned on a tiny dataset. So how is it possible that the SFT model wins?
If we consider the conclusions from the previous section, we can make sense of this. What I suggested there is that the primary reason RL works is because it discovers higher quality CoTs than the ones found in the datasets traditionally used for SFT.
With that in mind, what stands out about the two studies above? They both use highly curated datasets. Among other things, they use various techniques to select high quality and more difficult questions as training examples.
But critically, they both generate the CoTs for the training examples using reasoning models! The s1 paper uses Gemini Flash Thinking, and the LIMO paper uses DeepSeek r1. In other words, they use CoTs that are generated by models trained to produce LLM-optimal CoTs!3 But instead of using RL to train their model, they use SFT, and find that it works very well.
This supports my claim that one of the main advantages of RL is simply the discovery of better data. These papers show that when that same type of data is used to train a model with SFT, it is extremely effective. The problem has always been that SFT was used with lower quality data. We can get excellent reasoning results from SFT, as long as the data are good, and RL is an excellent way to find that good data.
a summary and some observations
This post is an attempt to explain my intuitions around why RL works and some of the things we are seeing in the research, and ground that in the actual behavior we see from reasoning LLMs. Everyone notices that reasoning models produce quirky CoTs, but I haven’t seen a lot of formal characterizations of those CoTs that connect them to the RL process itself in a satisfying way.
If I had to summarize my thesis, it is this: RL works largely because it produces LLM-optimal CoTs, which are characterized by short, tentative, repetitive, and numerous reasoning steps. Traditional SFT reasoning datasets had CoTs that were not LLM-optimal: they were contrived and artificial. Not good teaching examples for the models! RL works because through its exploration, it learns how to produce the best CoTs and optimizes the model on them. But those same CoTs can also elicit excellent reasoning capabilities through SFT, and that is what some studies have shown. This is further evidence that the data itself is the most important thing, not the objective function—which is not to say that reward doesn’t have certain advantages over loss, just that it’s not everything.
And now a few observations:
If you have heard people say things like, “you’ll know you’re doing RL right if the model’s thoughts look like gibberish”4, then this is another way of talking about LLM-optimal CoTs. Right now the best CoTs for LLMs that we know of look pretty similar to a human’s internal chain of thought, but that doesn’t mean they are the best possible ones. LLMs may think best by generating token sequences that are inscrutable to a human, but maybe our RL isn’t good enough to discover that yet.
Even if careful data curation can give good results in some studies, it will never beat RL in the long run. The advantage of RL is that with the right reward function, you can pretty directly turn dollars into compute, and compute into high quality data. It’s much less scalable to do this by replacing compute with humans.
Reasoning capabilities in LLMs are sometimes framed as elicitation of capabilities that are mostly already present in the model. See here, as well as the LIMO paper cited above. But what is the purest form of capabilities elicitation in LLMs? Prompting. A very interesting and easy experiment would be to redo the original CoT paper, but with the CoT examples in the prompts sourced from r1.
Currently, RL for reasoning is dependent on verifiable rewards. That’s why reasoning in the context of LLMs has been very conveniently defined as more or less only those domains that have verifiable rewards, like coding and math. The open question is whether something like verifiable rewards can be found for other domains. For what it’s worth, I think many people, myself included, find that reasoning models still seem to be better for most tasks, even those that don’t fall into the reasoning bucket for LLMs.
This is what I will call models that aren’t trained on RL for reasoning. Models like GPT-4o and Claude 3.5 Sonnet.
Might do a post on this.
r1 is of course trained with RL, but technically I have not been able to find an explicit statement from Google that the Gemini Flash Thinking model was trained with RL. However, they describe is as a “reasoning model”, and they do say that their Gemini 2.0 models were trained with RL, so I think it is a safe bet.
I believe Karpathy had a tweet like this but I can’t find it.