This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of tasks described via instructions—substantially boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 19 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of tasks and model scale are key components to the success of instruction tuning.
Language models (LMs) at scale, such as GPT-3 (Brown et al., 2020), have been shown to perform few-shot learning remarkably well. They are less successful at zero-shot learning, however. For example, GPT-3’s zero-shot performance is much worse than few-shot performance on tasks such as reading comprehension, question answering, and natural language inference. One potential reason is that, without few-shot exemplars, it is harder for models to perform well on prompts that are not similar to the format of the pretraining data.
Our empirical results underscore the ability of language models to perform tasks described using natural language instructions. More broadly, as shown in Figure 2, instruction tuning combines appealing characteristics of the pretrain–finetune and prompting paradigms by using supervision via finetuning to improve the ability of language models to respond to inference-time text interactions.
Model architecture and pretraining. In our experiments, we use a dense left-to-right, decoder-only transformer language model of 137B parameters. This model is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia tokenized into 2.81T BPE tokens with a vocabulary of 32K tokens using the SentencePiece library (Kudo & Richardson, 2018). Approximately 10% of the pretraining data was non-English. This dataset is not as clean as the GPT-3 training set and also has a mixture of dialog and code, and so we expect the zero and few-shot performance of this pretrained LM on NLP tasks to be slightly lower. We henceforth refer to this pretrained model as Base LM. This same model was also previously used for program synthesis (Austin et al., 2021).
Instruction tuning procedure. FLAN is the instruction-tuned version of Base LM. Our instruction tuning pipeline mixes all datasets and randomly samples examples from each dataset. Some datasets have more than ten million training examples (e.g., translation), and so we limit the number of training examples per dataset to 30,000. Other datasets have few training examples (e.g., CommitmentBank only has 250), and so to prevent these datasets from being marginalized, we follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3,000.3 We finetune all models for 30,000 gradient updates at a batch size of 8,192 using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5. The input and target sequence lengths used in our finetuning procedure are 1024 and 256 respectively. We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special end-of-sequence token.