Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Original Paper: https://arxiv.org/abs/2401.16380
By: Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly
Abstract:
Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased.
Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained.
This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web.
In this work, we propose Web Rephrase Augmented Pre-training (WRAP) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases.
First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by ∼3x. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%.
Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings.
Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it
(i) incorporates style diversity that closely reflects downstream evaluation style
(ii) has higher 'quality' than web-scraped data.
Summary Notes
Figure: (a) WRAP Recipe: We prompt an off-the-shelf instruction-tuned model to rephrase
articles on the web, and pre-train an LLM on a mixture of real and synthetic data.
(b) Zero-shot performance of GPT 1.3B models trained on combinations of C4 and synthetic
variations. Each step corresponds to a batch of 1M samples.
(c) Weighted average perplexity over 21 sub-domains of the Pile for varying model sizes and amount of pre-training data.
Introduction
In the ever-evolving landscape of machine learning, the training of large language models (LLMs) typically involves massive datasets scraped from the web.
This approach demands extensive computational resources and high-quality data, both of which are becoming increasingly scarce and expensive.
A recent study, "Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling," proposes an innovative solution to these challenges.
The researchers introduce Web Rephrase Augmented Pre-training (WRAP), a method that leverages synthetic data generated by rephrasing web documents in various styles to enhance the efficiency of LLM training.
Key Methodologies
The WRAP approach involves using an off-the-shelf instruction-tuned model to rephrase web documents into different styles, such as "like Wikipedia" or in a "question-answer format."
This synthetic data is then combined with real web data to pre-train LLMs. The methodology can be summarized in the following steps:
- Data Rephrasing: Using models like Mistral-7B, web documents are rephrased into specific styles. For instance, documents can be transformed to resemble high-quality Wikipedia articles or converted into question-answer formats.
- Combining Data: The synthetic rephrased data is mixed with real web data in various ratios to form the training corpus.
- Model Training: LLMs of different sizes (small, medium, and XL) are pre-trained using this combined dataset, optimizing for both perplexity and performance on zero-shot tasks.
Main Findings and Results
The study presents several significant findings:
- Improved Perplexity: Models trained with WRAP showed a more than 10% improvement in perplexity on average across different subsets of the Pile dataset. This indicates that WRAP-trained models can predict the next word in a sequence more accurately than those trained on real data alone.
- Enhanced Zero-Shot Performance: WRAP-trained models exhibited over 2% higher accuracy in zero-shot question-answering tasks across 13 benchmarks.
- Data and Compute Efficiency: By incorporating synthetic data, the models achieved equivalent performance with up to 5x less data and 3x less compute. For example, a 350M parameter model trained on a combination of real and synthetic data outperformed a 1.3B parameter model trained on real data alone.
Implications and Potential Applications
The implications of these findings are far-reaching:
- Cost Reduction: The ability to pre-train LLMs with significantly less data and compute resources can reduce the financial and environmental costs associated with large-scale model training.
- Accessibility: This method democratizes access to high-performing LLMs by enabling smaller organizations and academic institutions to train competitive models without requiring extensive resources.
- Data Diversity and Quality: The use of synthetic data introduces stylistic diversity and higher quality into the training dataset, which can enhance model performance on a broader range of downstream tasks.
Conclusion
The WRAP methodology offers a promising solution to the dual challenges of data scarcity and computational expense in LLM training.
By utilizing synthetic data generated through rephrasing, this approach not only improves model performance but also makes the training process more efficient and accessible.
As the field of machine learning continues to advance, innovative techniques like WRAP will play a crucial role in optimizing the development and deployment of powerful language models.
In summary, the integration of synthetic rephrased data into LLM training represents a significant step forward in the quest for more efficient and effective language modeling.
As researchers and practitioners explore this approach further, it holds the potential to transform the landscape of machine learning, making high-quality language models more attainable for a wider audience.