Google’s thought-prompting chain can boost today’s best algorithms

Google announced groundbreaking research in natural language processing called Chain of Thought Prompting that elevates the state of the art of advanced technologies like PaLM and LaMDA to what researchers call a remarkable level.

The fact that Chain of Thought Prompting can improve PaLM and LaMDA to such significant rates is a big deal.

LaMDA and PaLM

The research conducted experiments using two language models, the Language Model for Dialogue Applications (LaMDA) and the Pathways Language Model (PaLM).

LaMDA is a conversation-oriented model, like a chatbot, but can also be used for many other applications that require talking.

PaLM is a model that follows what Google calls the Pathways AI architecture where a language model is trained to learn how to solve problems.

It used to be that machine learning models were trained to solve one type of problem and they were basically freed up to do that thing very well. But to do anything else, Google would have to train a new model.

The Pathways AI architecture is a way to create a model that can solve problems it hasn’t necessarily encountered before.

As quoted in the Google PaLM Explainer:

“…we would like to train a model that can not only handle many separate tasks, but also take advantage of its existing skills and combine them to learn new tasks faster and more efficiently.”

What he does

The research paper lists three important breakthroughs for chain of thought reasoning:

  1. It allows language models to break down complex multi-step problems into a sequence of steps
  2. The thought process chain allows engineers to peek into the process and when things go wrong, it allows them to identify where it went wrong and fix it
  3. Can solve math word problems, can perform common sense reasoning, and according to the research paper can (in principle) solve any word-based problem a human can.

Multi-step reasoning tasks

The research gives an example of a multi-step reasoning task on which language models are tested:

“Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

A: The cafeteria originally had 23 apples. They used 20 to make lunch. So they had 23 – 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9.”

PaLM is a state-of-the-art language model that is part of the Pathways AI architecture. He’s so advanced he can explain why a joke is funny.

Yet as advanced as PaLM is, the researchers say Chain of Thought Prompting significantly improves these models, and that’s what makes this new research so noteworthy.
Google explains it this way:

“Chain-of-thought reasoning allows models to break down complex problems into intermediate steps that are solved individually.

Moreover, the linguistic nature of the chain of thought makes it applicable to any task that a person might solve via language.

The research paper goes on to note that the standard incentive does not really improve when the scale of the model is increased.

However, with this new approach, scale has a significant and noticeable positive impact on model performance.


Chain of Thought Prompting was tested on LaMDA and PaLM, using two math word problem datasets.

These datasets are used by researchers as a way to compare results on similar problems for different language models.

Below are images of graphs showing the results of using Chain of Thought Prompting on LaMDA.

The results of scaling LaMDA on the MultiArith dataset show that a modest improvement results. But LaMDA scores significantly higher when scaled with Chain of Thought Prompting.

The results on the GSM8K dataset show a modest improvement.

It’s a different story with the PaLM language model.

Thought Prompt Channel and PALM

As can be seen in the graph above, the gains from scaling PaLM with Chain of Thought Prompting are huge, and they are huge for both datasets (MultiArith and GSM8K).

The researchers call these results remarkable and new state of the art:

“On the GSM8K dataset of math word problems, PaLM shows remarkable performance when scaled with 540B settings.

…the combination of the thought prompt chain with the 540B parameter PaLM model leads to a new peak performance of 58%, exceeding the prior state of the art by 55% obtained by refining the GPT-3 175B on a large training course, define and then classify potential solutions via a specially trained verifier.

Moreover, follow-up work on self-consistency shows that the performance of the thought-prompt chain can be further improved by taking the majority vote of a large set of generated reasoning processes, which translates by 74% accuracy on GSM8K.


The conclusion of a research paper is one of the most important parts to check to understand whether the research advances the state of the art or is a dead end or needs more research.

The conclusion section of the Google research paper has a very positive rating.

He notes:

“We explored the Thought Prompt Chain as a simple and broadly applicable method for improving reasoning in language models.

Through experiments on arithmetic, symbolic, and commonsense reasoning, we find that chain-of-thought processing is an emergent model-scale property that enables sufficiently large language models to perform reasoning that otherwise have flat scale curves.

The expansion of the range of reasoning tasks that language models can perform will hopefully inspire new work on language-based approaches to reasoning.

This means that Chain of Thought Prompting may have the potential to provide Google with the ability to significantly improve its various language models, which in turn may lead to significant improvements in the kinds of things Google can do.


Read the Google AI article

Language models perform reasoning via chain of thought

Download and read the research paper

Chain of thought incitement sparks reasoning in large language patterns (PDF)

Comments are closed.