Original Paper: https://arxiv.org/abs/2305.02301
By: Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
Abstract
Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset.
Summary Notes
Figure: Overview of Distilling step-by-step. We first utilize CoT prompting to extract rationales from an LLM. We then use the generated rationales to train small task-specific models within a multi-task learning framework. We prepend task prefixes to the input examples and train the model to output differently based on the given task prefix.
Introduction
Deploying large language models (LLMs) like GPT-3 or PaLM in real-world applications remains a significant challenge due to their massive memory and computing requirements.
These models, while powerful, are often impractical for many engineering teams due to their resource-intensive nature.
This blog post explores an innovative approach—Distilling Step-by-Step—that not only mitigates these deployment challenges but also enhances performance using smaller models with less training data.
The Research Question
The primary question driving this research is: Can we train smaller, task-specific models that outperform LLMs using significantly less training data?
Traditional methods like finetuning and distillation require large datasets, either human-labeled or generated by LLMs, to match the performance of these giant models.
This study introduces a novel mechanism that leverages LLM-generated rationales as additional supervision, aiming to reduce both the size of the deployed models and the amount of training data required.
Key Methodologies
Distilling Step-by-Step
The core idea behind Distilling Step-by-Step is to transform the way we view LLMs—from being mere sources of labels to being agents capable of reasoning. This approach involves two main steps:
- Extracting Rationales from LLMs:
- Using Chain-of-Thought (CoT) prompting, LLMs generate intermediate rationales that justify their predicted labels. These rationales offer rich, task-relevant information that can be used to train smaller models.
- For example, when asked, "Jesse’s room is 11 feet long and 15 feet wide. If she already has 16 square feet of carpet, how much more carpet does she need to cover the whole floor?", the LLM can provide a rationale like "Area = length × width. Jesse’s room has 11 × 15 square feet," leading to the final answer.
- Multi-Task Training Framework:
- The extracted rationales are used in a multi-task learning setup. The smaller model is trained to predict both the task label and the rationale. This dual-task training enhances the model’s understanding and performance.
- The training loss is a combination of label prediction loss and rationale generation loss, enabling the model to learn both tasks simultaneously.
Main Findings and Results
Data Efficiency
Distilling Step-by-Step significantly reduces the amount of training data required:
- On the e-SNLI dataset, the method achieved better performance than standard finetuning using only 12.5% of the full dataset.
- Similar reductions in data requirements were observed across other benchmarks like ANLI, CQA, and SVAMP.
Model Size Reduction
The approach drastically reduces the size of the deployed models:
- Distilling Step-by-Step outperformed the 540B PaLM model using models up to 2000Ă— smaller.
- For instance, a 220M T5 model trained with this method outperformed the few-shot CoT performance of the PaLM model on the e-SNLI dataset.
Performance Gains
The method consistently outperformed both standard finetuning and task distillation:
- When compared to standard methods, Distilling Step-by-Step achieved up to 85% reduction in labeled training examples needed to match or surpass the performance of LLMs.
Implications and Applications
Real-World Deployability
This research has profound implications for the deployability of language models in practical applications:
- Cost Efficiency: Smaller models require less computational power and memory, making them more accessible for engineering teams with limited resources.
- Low Latency: Reduced model sizes translate to faster inference times, which is crucial for applications requiring real-time performance.
Potential Applications
- Natural Language Processing (NLP): Enhanced task-specific models for tasks like sentiment analysis, question answering, and textual entailment.
- Edge Computing: Deploying efficient models on edge devices where computational resources are limited.
- Educational Tools: Using smaller models to provide real-time feedback and explanations in educational apps.
Conclusion
Distilling Step-by-Step represents a significant leap forward in the efficient training and deployment of language models.
By leveraging LLM-generated rationales, this approach not only reduces the size and training data requirements of task-specific models but also enhances their performance.
This paradigm shift opens up new possibilities for deploying high-performing language models in a wide range of real-world applications.
As we move forward, further research could explore the impact of rationale quality on model performance and how to best generate and utilize these rationales across different tasks and domains.
The potential to democratize access to powerful language models through such innovations is immense, paving the way for more inclusive and widespread use of advanced AI technologies.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →