Futures Part 2 - LLM-syncratic Strategies
Introduction
In the first post in this series I laid out a taxonomy of prompting strategies for LLMs. The purpose of this taxonomy is to serve as the starting point for answering the question of whether prompt quality will remain important as LLMs become more capable.
In this post I’m going to consider that question specifically in relation to one of the types of prompting strategies I discussed in the first post, which I then described as “Strategies that insert (mostly) task-agnostic content into a prompt.”
I’ll repeat what I wrote in the previous post about these strategies, since it’s not very long:
Examples:
role-playing
arrange the components of a prompt in a specific order
There are a couple of ways of characterizing this group.
First, the strategies in this group can each be applied to a prompt without knowing much, if anything, about the specific task. To be sure, many of these strategies are specific to task types, but not to individual tasks. Telling the LLM to think step by step or compare possible responses will generally be useful for most reasoning tasks, regardless of what the specific problem is. Using XML tags around examples will generally be useful for prompts for Claude models, independent of the substance of the examples.
Second, they are all strategies that seem like they wouldn’t matter very much if you were giving instructions to a human—assuming the human is intelligent, knowledgeable in the relevant domain, focused, and willing to execute the task3. You would not expect a mathematician to write a better proof merely because you told (reminded) them “you are a mathematician” at the start.
One might think of these strategies as something like the “one weird trick”s of prompting. They function like a secret button that, when pressed, will get you a better quality response from the LLM, without meaningfully modifying the task instruction itself. They unlock LLM capabilities that are readily available, but (perhaps puzzlingly) are not used unless elicited through these strategies.
As the description suggests, I think we can understand these strategies as being idiosyncratic to LLMs. Or rather, they are ways to compensate for the idiosyncrasies and deficiencies of current LLMs, as reasoners. And thus, what will surely become the preferred term of art: the name “LLM-syncratic strategies”.
The effectiveness of LLM-syncratic strategies is a direct consequence of the peculiarities of the training data/strategies and architecture of a particular model. That is why many of these strategies are things that shouldn’t really work on an entity that reasons in the way that most humans might intuitively think of reasoning, but they do work, because LLMs are imperfect reasoners and these strategies address their idiosyncratic imperfections1.
Future
So, what will happen to these prompting strategies as LLMs become more capable?
I predict that these strategies will become less important—and many of them likely irrelevant—as LLMs become more capable. For the most part, the idiosyncrasies of LLMs that these strategies target are precisely the sort of problems that I expect the next generation of LLMs to have solved, or at least made considerable progress on.
More training data, more parameters, and some other assuredly clever innovations will of course be helpful. But plausibly, better fine-tuning is the only necessary solution. A prompt, by definition, can only elicit capabilities that an LLM already has. When considering the future of LLM-syncratic strategies, the question is not, then, whether future LLMs will be able to solve more advanced math problems, for example. Rather it’s about getting the LLM to consistently solve math problems which we know it already has the ability to solve. The problem is that right now, LLMs sometimes just don’t do what they’re capable of, unless the prompt is just right.
It’s not hard to envision a fine-tuning process that would more reliably elicit these capabilities. Because the strategies under discussion are task-agnostic, but common to general classes of problems, you could write a kind of meta-prompt—or more likely, a set of such prompts—to wrap around traditional user prompts. The meta-prompt would detail all the best known prompting strategies for different problem types, and rewrite the user prompt to apply the relevant ones.
As a highly compressed example, such a meta-prompt might look something like this:
If the question involves reasoning beyond mere knowledge retrieval, explain your thinking step-by-step. If the question involves expertise about a domain of knowledge, adopt the persona of an expert in that domain. If the question has illustrative examples, rewrite the examples with XML tags around them. If the question has a reference text and instructions, place the instructions after the reference text…
Then the outputs of the LLM on the rewritten prompt could be used as inputs for whatever the chosen fine-tuning procedure is. Whatever the exact setup, the important point is that, given a set of task-agnostic prompt strategies, much of the performance gains from these strategies should be able to be fine-tuned into the model.
I expect that the major LLM providers are already working on this—not necessarily in the exact manner I proposed above, but at least in the sense of somehow “folding in” the performance gains from the well-known task-agnostic strategies. And in fact, it should be obvious that they already are: for example, even today (in May 2024) the best LLMs will answer with step-by-step reasoning to certain types of problems (e.g., word problems), without explicit prompting to use that strategy (e.g., GPT-3.5, GPT-4, and Claude 3 Sonnet all do this—I’m pretty sure even plenty of <10B parameter models do this too, but I haven’t tested as rigorously). These models have of course been fine-tuned on step-by-step reasoning exemplars, and I expect that their future counterparts will be fine-tuned on data using any analogous reasoning strategies that are reported in the literature (or discovered internally).
Other Considerations
I think the above will be true of many of the well-known prompting strategies, the ones that for some reason get a fair amount of attention even outside of the research literature. Identifying these strategies is an important exercise, but probably that very identification will make them obsolete as prompting strategies within a few months or years as the best LLMs are continuously fine-tuned to reproduce the results of those strategies without needing explicit prompting to do so.
A few of the other task-agnostic strategies might behave a little bit differently though. One I have in mind is that of arranging the components of a prompt in a specific order. In some cases—especially when the prompt is very long—this will probably require more capable models.
As an example, here is a scenario I’ve seen repeatedly in my own experience: when a prompt contains a moderately long reference text (say, more than a few thousand tokens) as well as moderately long and complex instructions about what to do with that reference text, the output will be higher quality if the reference text is placed before the instructions, than if it is placed after the instructions2. This is especially true for less capable models—the difference in output quality can be severe for GPT-3.5, but it is still clearly observable at times with GPT-4.
This is, I think, more of a model capability/architecture problem than one that is solvable through fine-tuning. It seems to be more of an issue of how much the model can effectively “focus” on at once, and biasing its “focus” towards the most recent part of the input—this is more a function of the model’s base capability than it is about setting the right context through a prompt to get the best answer3. Related to this, there are also some more clearly idiosyncratic behaviors where changing the order of seemingly arbitrary prompt elements results in large output differences. This I also expect to improve with better models.
Explicit Reasoning
There is a subgroup of related LLM-syncratic strategies that deserves its own comment. These are strategies like thinking step by step, tree of thought, ranking, etc. These are all still firmly in the task-agnostic group of strategies, but what this subgroup has in common is that its constituents all entail prompting the LLM to make its reasoning explicit in some way—e.g., by spelling out its reasoning steps, proposing multiple options, evaluating multiple options, or some combination of those sorts of things.
As I mentioned above, something like the simple instruction to think step by step is so fundamental at this point that I don’t think it will remain very important as a prompting strategy for long. The other explicit reasoning strategies could also possibly be fine-tuned into LLMs. But what’s interesting about these strategies is that they don’t just impact the output quality, but also its form. Because they force the LLM to carry out a reasoning process, they make for longer outputs that contain a lot of information, or at least material, beyond just the answer to a question.
For certain problems, I predict that more capable LLMs won’t even need to use these explicit reasoning strategies to arrive at the correct answer. I concede this is not a very bold prediction, because it’s a simple extrapolation of the pattern already evident with existing models. Taking word problems as an example again, the biggest models like GPT-4 and smaller models (<10B parameters) can both solve them correctly with step by step reasoning. But even without step by step reasoning, GPT-4 can get the answer for many of these same problems right, but smaller models cannot4. So what I expect is that, wherever the current threshold is for problem complexity beyond which a model like GPT-4 requires step by step thinking to get the right answer—regardless of whether it actually uses that strategy—that bar will be higher for a model like GPT-5.
The more interesting question is how user needs and expectations will change. The exact location of the complexity threshold will increase over time, but the threshold itself, wherever it may be, will presumably still exist (for the near term future, at least). The threshold will get higher as LLMs become more capable, and as LLMs become more capable, people will use them for increasingly complex tasks. More complex tasks will tend to be more open-ended. In those cases, I predict that much of the value of LLMs will be not so much in providing a specific, quantified answer, but rather in providing explicit reasoning itself as the desired output.
If we picture using LLMs to work on tasks that we might otherwise assign as a project to a person, or to a team of people, or to an entire organization, then it should be obvious that those are tasks where we usually aren’t looking for a short answer. For things like writing a business strategy, policy analysis, research programs, etc., we will (hopefully) not be treating LLMs as oracles that provide a terse, unsupported answer. Instead we’ll be looking for LLMs to explicitly reason: identify important factors, apply relevant knowledge, and think through those elements logically—just like when individuals or organizations work on these problems, they typically produce a lengthy report supporting a recommendation, not simply an edict. The value of LLMs on these problems will come largely from humans being able to observe and interact with that reasoning, not just from taking an answer as a given.
Wrap-up
In general, LLM-syncratic prompting strategies will become less important. Partly because better fine-tuned models will employ these strategies without explicit prompting, and partly because more capable models just won’t need them in the first place.
As LLMs take on more complex and open-ended tasks, explicit reasoning as the output of an LLM will become more valuable, but explicit reasoning as a prompting strategy will become less important, because the LLMs will be able to do it without explicit prompting.
This covers my predictions for this group of prompting strategies. Two more groups remain, as defined in the taxonomy I proposed in the first post in this series. In upcoming posts I’ll talk about what I expect to happen with those prompting strategies, as well as other thoughts on prompting and LLMs, based on research and industry experience.
I am not trying to make a claim here as to whether LLMs truly “reason”. Only that they clearly at times appear to reason. Whatever is actually going on there, in this post I’ll call that “reasoning” out of convenience.
I believe I saw something like this recommended in an old Anthropic prompting guidelines resource, but was never able to find it again. At any rate, whether I imagined the tip or not, it has proven to be useful.
There is an interesting connection to discourse structure here, in that new information is typically placed later in a text. To the extent that LLMs bias their apparent “focus” towards the later elements of a prompt, this may be mirroring the structure of natural language, and so could be seen as either a desirable feature—in that it means LLMs are faithful representatives of language, i.e., language models—or an undesirable feature—in that it means LLMs are mere language models rather than reasoning models, as many would like them to be.
The claims here are, of course, generalizations. Not necessarily true of all word problems for GPT-4 nor of all small models, etc.