Production ML Papers to Know

Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

Background: chain-of-thought prompting

This paper builds on an earlier approach to prompt engineering: chain-of-thought prompting.

In a standard approach to prompting, you might provide some example input / output pairs as part of the prompt, with the hope that the model will be able to generalize to the new input you care about. However, for complicated input / output mappings, standard prompting often leads to poor generalization: with only a few examples, the model can’t figure out the pattern that relates inputs and outputs.

Chain-of-thought prompting refers to providing a sequence of logical steps to get from input to output alongside each example. For example, in the figure above, rather than just providing a world problem alongside the expected answer, the authors also show the LLM how to break the word problem into a sequence of smaller problems and solve those sequentially.

The surprising finding from the paper, which has proven to be consistently true in practice as well, is that this style of prompting leads to better performance and generalization. LLMs respond well to seeing examples of the reasoning behind the answer, not just the answer itself.

From chain-of-thought to least-to-most prompting

Chain-of-thought prompting is an improvement over standard prompting, but it struggles with easy-to-hard generalization, where the model is asked to solve problems harder than the examples provided in the prompt.

That’s where least-to-most prompting comes in. The approach works in two stages:

  • Stage 1: Problem Reduction. Split the complex problem to a set of easier subproblems
  • State 2: Sequentially solve subquestions. Solve each subproblem, with answers helping solve the next one

Each stage requires examples. In problem reduction, the prompt passed to the model contains examples of breaking the problem into pieces. In sequential subquestion solving, the prompts show how to answer subproblems, as well as linking each subproblem to the previous and next subproblems to be solved.

The figure below, from the paper, illustrates an example.

Complex Reasoning across a range of tasks

The approach was tested on symbolic manipulation, compositional generalization, and math reasoning challenges - and the results “show that least-to-most prompting can indeed generalize to problems harder than those demonstrated.”

For symbolic manipulation, the paper used the last-letter-concatenation task, where the input is a list of words and the output is the concatenation of the last letters of the words in the list. This is a task that is trivial for humans but hard for traditional LLMs.

Here, the subproblems are the individual list items, with the next list item then incrementally added. The lists used in demonstration contain at most three words, while the test lists contain four or more words.

The table below shows that least to most prompting significantly outperforms the baseline approaches of standard prompting and chain-of-thought prompting, especially as the list length increases.

For math reasoning, the paper users the numerical reasoning subset in DROP (which contains 5,850 problems),and the GSM8K dataset (containing linguistically diverse grade school math word problems).

Adding an additional baseline method - Zero-Shot - least to most prompting still outperforms the other approaches, as we can see from the table below.

And, the papers’ authors believe their approach would perform even better on math reasoning tasks that require more steps to solve.

Least to most prompting: from prompt magic to prompt engineering?

Least to most prompting does have some limits: for example, accuracy tails away as the symbolic manipulation task increases in difficulty.

Furthermore, not all problems can be solved by least-to-most prompting as they may not be reducible or easy to reduce.

But being able to demonstrate easy-to-hard generalization has been an inspirational result in the emerging field of prompt engineering.

Check out the paper is here.