If you work with LLMs, at some point you’ve likely come across the claim that in the future, as LLMs become more advanced, prompt engineering will become less important, or even completely irrelevant to successfully working with LLMs. There are usually two types of reasons given for this sort of claim.
The first is that future LLMs will be so intelligent, and therefore so good at understanding language, that they’ll be able to produce good outputs even with a low quality prompt as input. In other words, even if a prompt is not well-formulated, the LLM will be able to discern the user’s intent and produce the desired output anyways.
The second is that future LLMs will be able to write their own prompts that are superior to human-written prompts. There is ongoing work on automatic prompting (e.g., here and here), so this may already be true to a certain extent.
If either (or both) of these things materializes, then it follows that the value of high quality prompts written by humans will decrease, and so will the value of prompt engineering as a skill.
This is the first in a series of posts in which I’ll consider this question. In the present post I’m going to develop the question a little more, and put in some of the groundwork necessary to answer it. In subsequent posts I will more directly evaluate what current trends in research and practice mean for prompts.
Throughout these posts I’m going to mostly focus on prompt quality—by which I mean the distinguishing features that make a prompt better than another for a given task, or the strategies to achieve that—rather than prompt engineering—by which I mean a skill of a particular person, or a job function, or even a job itself. For reasons which I will leave in a footnote1, I think the concept of prompt engineering is distracting, so I don’t want to mention it any more than I have to.
Anyways, there are some preliminary points worth mentioning:
First, it seems prima facie equally plausible that prompt quality will matter more as LLMs become more capable. Better LLMs will have a higher ceiling in terms of output quality. If prompt quality continues to matter at all, then the absolute value of prompt quality will become higher, even if the relative value is lower. All this is to say that I don’t think it should be a given that more capable LLMs will mean prompts become less important. Or at least, not in all cases.
Second, the importance of a prompt depends partly on the way an LLM is being used. Broadly speaking there are two ways of using LLMs: 1) interactively and 2) programmatically. Interactive usage is using an LLM through a traditional chat interface, like with ChatGPT. Programmatic usage is using an LLM by sending structured requests to the model provider’s API.
In interactive usage the user has nearly unlimited opportunity to iteratively refine their prompt in conversation with the LLM. Effectively, the entire conversation is the prompt (with some allowances for limited context windows). For many interactive use cases, the LLM’s output doesn’t need to be perfect, it just needs to be good enough, for example: generating a text (e.g., a code snippet) that you can further adapt to your own needs, or creating a list of options (e.g., recipes, vacation destinations) to choose from. In programmatic usage, a more typical use case is for a single prompt to handle thousands/millions/more requests, paired with a novel input every time. The output of the prompt can often be evaluated by metrics (e.g., accuracy, recall, user satisfaction), and is often integrated into a software system, which means that not only does the output need to be high quality according to those metrics, but the form of the output must also be structured in a specific way, or else it will be useless. My experience is that achieving a high quality prompt is significantly harder, and more important, in programmatic usage2. Much of the strategies I’ll talk about below are especially applicable to programmatic usage, though not exclusively.
As a very rough analogy, you might think of writing prompts in interactive mode as like writing some functions in a Jupyter notebook, and writing prompts in programmatic mode is like building and deploying a software service.
Third, when I talk about future LLMs and what prompt quality will mean for them, I mean the next couple of generations of LLMs, as of April 2024. The next wave is, apparently, imminent, and I am not one to speculate on what exactly those models will look like, but I assume they will in many ways be similar to the current best models like GPT-4 or Claude-3, only “smarter”. That means they will take text or visual inputs, and return like outputs. They may be able to orchestrate more external tools—and even other LLMs—and execute more complex real world actions, but they won’t be intrinsically agentic and autonomous. In other words, you will still have to ask these models to do something, they won’t just be out in the world doing whatever they want. Of course, we will have to wait and see, and maybe that will make what I say here irrelevant.
Fourth, throughout this series I’m going to assume the reader is moderately familiar with LLMs and prompts. I’m not generally going to explain well-known strategies in much detail, or give a lot of examples of prompts, because I’ll assume that the reader knows what these types of things look like and can discern how they would be applied.
The Question
Now turning to the actual question: will prompt quality become less important? Again, in this post I’m not going to directly answer this question. I will turn to that in later posts. Instead, in this post I’m going to lay the foundation for answering it.
To start, I find a more intuitive way of thinking about prompt quality to be by thinking about the types of strategies that can be used to make prompts better. If you start with Prompt A that gives you 75% accuracy on some relevant metric, what strategies can you employ to write a Prompt B that gives you 95%?
With this framing, the question then becomes this: will these strategies remain necessary or useful as LLMs become more capable?
To answer this, we need to have a clear idea of what these strategies are. I’ll spend most of the remainder of this post trying to approach that clear idea by outlining a taxonomy of prompting strategies.
Prompt Strategy Taxonomy
Very generally, I think we can say that there are 3 types of strategies for making prompts better:
Group A: Strategies that insert (mostly) task-agnostic content into a prompt.
Group B: Strategies that parameterize the task.
Group C: Strategies that add original, task-specific information: procedures, concepts, definitions, and context.
Let’s take these one by one.
Group A - Strategies that insert (mostly) task-agnostic content into a prompt.
Examples:
role-playing
arrange the components of a prompt in a specific order
There are a couple of ways of characterizing this group.
First, the strategies in this group can each be applied to a prompt without knowing much, if anything, about the specific task. To be sure, many of these strategies are specific to task types, but not to individual tasks. Telling the LLM to think step by step or compare possible responses will generally be useful for most reasoning tasks, regardless of what the specific problem is. Using XML tags around examples will generally be useful for prompts for Claude models, independent of the substance of the examples.
Second, they are all strategies that seem like they wouldn’t matter very much if you were giving instructions to a human—assuming the human is intelligent, knowledgeable in the relevant domain, focused, and willing to execute the task3. You would not expect a mathematician to write a better proof merely because you told (reminded) them “you are a mathematician” at the start.
One might think of these strategies as something like the “one weird trick”s of prompting. They function like a secret button that, when pressed, will get you a better quality response from the LLM, without meaningfully modifying the task instruction itself. They unlock LLM capabilities that are readily available, but (perhaps puzzlingly) are not used unless elicited through these strategies.
Group B - Strategies that parameterize the task.
Examples:
define the length of the output (e.g., 500 words)
say which programming language to write the code in (e.g., Javascript)
specify output format (e.g., bullet points)
tone (e.g., professional)
By “parameterize”, I mean something analogous to parameters in a function in code: concrete details that specify the inputs and/or outputs.
To illustrate, let’s look at OpenAI’s guide to prompting, which has an instructive example of a “worse” prompt and its “better” counterpart:
Worse: How do I add numbers in Excel?
Better: How do I add up a row of dollar amounts in Excel? I want to do this automatically for a whole sheet of rows with all the totals ending up on the right in a column called "Total".
Hopefully it’s clear that what makes the “better” prompt better is that it contains more details about the task. It explains the type of numbers to be added (dollars), what form they will be in (a row), the output form (a column called “Total”), etc.
Sticking to the function analogy, the first prompt is like writing a function header like this:
def add(numbers):
Whereas the second prompt is like writing this:
def add(row, output_column_name, whole_sheet=True):
If you had to fill in the function logic for each, the second header is much more likely to lead to the most useful function for this particular task, because it’s parameterized with the relevant information.
As another example, let’s take a prompt:
“Summarize this”
It’s easy to come up with parameterized versions of this prompt which are better:
“Summarize this in 300 words”
“Summarize this research paper and cite the source in MLA”
“Write a 1-2 page summary of this textbook in a style suitable for a high school history class reading assignment.”
In contrast with the strategies in group A, group B strategies cannot be applied to virtually any prompt. You cannot make a whole class of prompts better by just adding the string “in 300 words” to the end. It only works if you actually want the output to be 300 words. Whereas a group A strategy like “think step by step” will work for nearly any reasoning task.
Group B strategies are task-specific, because they require additional knowledge about the task that is not discernible from the prompt. The “better” prompts in the examples above are not mere reformulations of the “worse” prompts with some special sentences thrown in that predictably yield better results—the difference in the prompts is that the “better” ones contain additional information about the task.
Group C - Strategies that add original, task-specific information: procedures, concepts, definitions, and context.
Examples:
define an ontology for intent classification
provide context and guidance to generate an interesting research question
construct representative and diverse examples
decompose a complex task into discrete steps
Group C strategies, like group B, entail adding more information to a prompt. The distinguishing characteristics of the type of information I have in mind with group C are most easily expressed in negative terms: it’s information that is not merely a collection of facts; it’s information that is not quantifiable; it’s information that is not succinctly statable. In more positive terms, the information in group C strategies requires some creativity and deep understanding of the task to formulate.
As a first example, suppose you are using an LLM to label texts for an intent classification problem. To do this, your prompt will need to define an ontology of intents. In some projects an ontology might be as simple as a 1 sentence definition of each class. But in other cases ontologies can be a lot more: think hundreds or even thousands of words of definition, along with dozens of carefully chosen examples to illustrate the core features of the class and its edge cases, as well as complex hierarchical relations between classes. Constructing a good definition of a class requires a fine understanding of semantic ambiguity, as well as sensitivity to other constraints like the external motivation for the project or the other technological components the classifier will be integrated with.
Another example is prompting an LLM to come up with a good4 domain-specific research question. In order to do this you might need to supply a relevant literature review, choose a theoretical framework, identify gaps in the field, and characterize what a useful research question in the domain looks like.
The information needed to write good prompts of these types often requires synthesis of different sources of information, experience in the field, and the ability to make usually implicit principles explicit. It’s often information that is hard to come by and clearly state, even for the user. It often requires a lot of work and thought and experimentation to come up with, and then put into a clear and understandable form in a prompt.
The examples given for this group are necessarily left in the abstract, because it is very hard to come up with a concrete example without coming up with an actual complex task in the first place, which is itself very hard to do5. The strategies in this group are not so much a collection of specific strategies like in groups A and B, but more the application of thought and subject-matter expertise to complex and original problems. This group is admittedly the most difficult to define concretely, and very likely finer distinctions could (and perhaps will) be made, but this will work for the present purposes.
Wrap-up
The taxonomy above is, again, tentative. There are probably strategies that straddle more than one of these groups, and there might be some strategies that are hard to categorize. Nonetheless I think it is a useful starting point for thinking about whether certain prompting strategies will become less important as LLMs become more capable.
In the coming posts I’ll talk about these groups of strategies in more detail, and try to determine whether they will remain important, in view of current research and related trends. I’ll also talk about things like automatic prompting and new ideas for ways to work more effectively with LLMs.
It seems that for some people, the notion of there existing a job title of “prompt engineer” is extremely offensive. And even more so if the position is highly paid. I can, but will not here, speculate as to why this is. At any rate, my job title is not “prompt engineer” so I have no skin in this game.
My impression is that much of the internet commentary that prompt engineering is trivial and/or overrated comes from people who have never had to write a prompt for a difficult programmatic use case.
Big assumptions, to be sure. It actually would be great to do research on all the different prompt strategies that have been studied in the literature, but instead of testing them on LLMs, test them on humans. Probably there are many cases where telling a person to “think step by step” will yield better results.
It’s important to emphasize “good” here, because LLMs can trivially come up with research questions about a lot of things. But coming up with good ones is another matter.
But I am not just speculating that these types of tasks and prompts exist—I know they do and I have worked on them, and I assume others in the field have too, though it’s rare to find real examples out in the wild. I can’t share the exact things I’ve worked on because they are my employer’s IP, but in later posts if/where it becomes necessary, I will provide artificial examples.