[Research 🔬] Chain of thought reasoning - language models (by Google Brain)

Blog post: Google AI Blog: Language Models Perform Reasoning via Chain of Thought
arXiv: https://arxiv.org/abs/2201.11903 (Chain of Thought Prompting Elicits Reasoning in Large Language Models)
Google I/O keynote: Google I/O 2022: Advancing knowledge and computing

In “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” we explore a prompting method for improving the reasoning abilities of language models. Called chain of thought prompting, this method enables models to decompose multi-step problems into intermediate steps. With chain of thought prompting, language models of sufficient scale (~100B parameters) can solve complex reasoning problems that are not solvable with standard prompting methods.

Abstract

Although scaling up language model size has reliably improved performance on a range of NLP
tasks, even the largest models currently struggle with certain reasoning tasks such as math word problems, symbolic manipulation, and commonsense reasoning. This paper explores the ability
of language models to generate a coherent chain of thought—a series of short sentences that mimic the reasoning process a person might have when responding to a question. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves. When combined with the 540B parameter PaLM model, chain of thought prompting achieves new state of the art of 58.1% on the GSM8K benchmark of math word problems.

Language model. We experiment on two collections of dense left-to-right, decoder-only transformer language models. The first collection is LaMDA (Thoppilan et al., 2022, we use the pretrained checkpoint only, with no dialog finetuning), which has models of 422M, 2B, 8B, 68B, and 137B parameters. The second collection of models is PaLM (Chowdhery et al., 2022), which has sizes of 8B, 62B, and 535B parameters. We sample from the model using greedy decoding, though follow-up work has shown that chain of thought prompting can be improved by a large margin by taking the majority final answer over many sampled generations (Wang et al., 2022). For LaMDA, we report averaged results over five random seeds, where each seed had a different randomly shuffled order of exemplars. As LaMDA experiments did not show large variance among different seeds, for PaLM, we report results for a single random seed

8. Conclusions

We have explored chain of thought prompting as a simple and broadly applicable method for enhancing reasoning in language models. Through experiments on arithmetic, symbolic, and commonsense reasoning, we find that chain of thought processing is an emergent property of model scale that allows sufficiently large language models to perform reasoning tasks that otherwise have flat scaling curves. Broadening the range of reasoning tasks that language models can perform will hopefully inspire further work on language-based approaches to reasoning.

Blog post: Google AI Blog: Language Models Perform Reasoning via Chain of Thought
arXiv: https://arxiv.org/abs/2201.11903 (Chain of Thought Prompting Elicits Reasoning in Large Language Models)
Google I/O keynote: Google I/O 2022: Advancing knowledge and computing

1 Like