State of the Prompt

Is This Really Scalable?

With all the prototypes and custom apps we've built on the OpenAI platform, we believe we have some insights into the newly minted discipline of 'prompt engineering'. We've seen the press hyping salaries for ‘prompt engineers.’ We’ve read (and tested) plenty of academic articles on prompt design. We think there is some value to working on a well-designed set of instructions for the LLM model, but there is a limit to what a prompt – even a heavily engineered prompt – can achieve.

We recently worked through an interesting idea in the prompt field called ‘Tree of Thought’. This basically nets out to providing instructions to the LLM model which encourage deliberate step by step reasoning in solving a problem. We ran the tests multiple times on the same problems. And then changed up the problems and the prompts to see how we could improve the quality of outcomes. Our conclusion was that there was little hope that this approach would be scalable in any large (or even mid-size) organization. It was difficult to replicate results.

But we want to provide an excerpt from the article written by researchers from Princeton and Google. It is worth thinking about – but we believe there are better ways to invest time and energy in harnessing the great capabilities of LLMs. Stay tuned for 2024!

It is more or less commonly admitted that bare LLMs fail on involved, multi-step reasoning problems. And because of lack of access to large models and the cost-driven model inertia (it’s too expensive to retrain giant models), one has to use a fixed model and find better ways to prompt them. The starting point of this paper is to notice that, as elaborate as they can be, all common prompting strategies for LLMs are “forward-only”: input-output prompting starts with one or a few examples of input-output texts and generates a single input; Chain-of-thought prompting goes a step further to deal with more complex input-output relations by prompting the model with not only input-output examples but with input-step1-step2-...-step#-output examples, which leads the model to reproduce a similar reasoning; Self-Consistency with CoT, which is tailored for tasks where a ground-truth answer is expected, generates multiple CoT outputs and then does a majority vote to define its final output. In contrast, when we want to solve a problem, we reason in multiple steps, exploring for each step multiple options at our hand, choosing the one that seems the most promising at one point but potentially abandoning later because it leads to a dead-end, and exploring the second most promising one, etc. In essence, human reasoning follows a tree structure, where we combine a mix of breadth-first-search and depth-first-search depending on our intuition, to search for the solution. This paper emulates this process for tasks where this is tractable via a prompting technique the authors call Tree of Thoughts – ToT (Section 3 contains a more rigorous presentation of the implementation of their method). On Game of 24, a game where the goal is to use 4 given numbers to obtain 24 using basic arithmetic operations: “while GPT-4 with chain-of-thought prompting only solved 4% of tasks, [their] method achieved a success rate of 74%.” They used ToT with BFS with a breadth of 5. The second-best (non-ToT) prompting strategy reached 49%.