Original Paper: https://arxiv.org/abs/2403.10131
By: Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, Joseph E. Gonzalez
Abstract:
Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally bake in new knowledge (e.g., time-critical news, or private domain knowledge) into the pretrained model either through RAG-based-prompting, or fine-tuning. However, the optimal methodology for the model to gain such new knowledge remains an open question. I n this paper, we present Retrieval Augmented FineTuning (RAFT), a training recipe that improves the model's ability to answer questions in a "open-book" in-domain settings. In RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don't help in answering the question, which we call, distractor documents. RAFT accomplishes this by citing verbatim the right sequence from the relevant document that would help answer the question. This coupled with RAFT's chain-of-thought-style response helps improve the model's ability to reason. In domain-specific RAG, RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG
Summary Notes
Figure 2:Overview of our RAFT method. The top-left figure depicts our approach of adapting LLMs to reading solution from a set of positive and distractor documents in contrast to standard RAG setup where models are trained based on the retriever outputs, which is a mixture of both memorization and reading. At test time, all methods follow the standard RAG setting, provided with a top-k retrieved documents in the context.
Introduction
Imagine preparing for an open-book exam. You have access to all the reference materials, but without proper preparation, you might struggle to find the right answers quickly. This is akin to how Large Language Models (LLMs) operate when tasked with domain-specific questions. While these models can retrieve relevant documents, they often falter without proper training to sift through the data effectively. Enter Retrieval-Augmented Fine-Tuning (RAFT), a novel technique designed to bridge this gap and enhance LLM performance in domain-specific settings.
The Challenge
LLMs, such as GPT-3, have shown incredible prowess in general knowledge tasks. However, when it comes to specialized domains like legal or medical document retrieval, their performance can be hit or miss. The traditional methods—either relying solely on pre-trained knowledge (akin to a closed-book exam) or using in-context retrieval without prior fine-tuning—fall short in these scenarios.
RAFT aims to address this by combining the strengths of instruction fine-tuning (IFT) and Retrieval-Augmented Generation (RAG). The goal is to enable models to leverage domain-specific knowledge effectively and improve their robustness against irrelevant or "distractor" documents.
Key Methodologies
1. Supervised Fine-Tuning
In a supervised fine-tuning (SFT) setting, the model is trained on a dataset of question-answer pairs without any additional context. This method helps the model learn general patterns but doesn't prepare it for the nuances of domain-specific document retrieval.
2. Retrieval-Augmented Fine-Tuning (RAFT)
RAFT enhances the traditional SFT approach by incorporating relevant documents into the training process. Here's how:
- Inclusion of Distractor Documents: During training, the model is exposed to both relevant ("golden") documents and irrelevant ("distractor") documents. This mimics real-world scenarios where not all retrieved documents are useful.
- Chain-of-Thought Reasoning: The model is trained to generate answers that include a reasoning process, often citing the text from relevant documents. This not only improves the accuracy but also the model's ability to handle distractors.
- Varied Training Data: RAFT incorporates a mix of data points with and without the golden documents. This technique compels the model to learn when to rely on its pre-trained knowledge versus when to extract information from the context.
Main Findings
Enhanced Performance
Across multiple datasets—including PubMed, HotpotQA, and Gorilla API Bench—RAFT consistently outperformed traditional SFT and domain-specific fine-tuning methods. For instance, on the HotpotQA dataset, RAFT showed a significant performance boost, highlighting its effectiveness in handling complex, multi-hop questions.
Robustness to Distractors
One of the standout features of RAFT is its robustness to irrelevant documents. By training with a mix of golden and distractor documents, RAFT-equipped models were able to maintain high accuracy even when faced with varying numbers of test-time documents.
Chain-of-Thought Impact
Incorporating a reasoning chain significantly improved the model's ability to generate accurate answers. This approach prevents overfitting to concise answers and enriches the model's understanding, making it more resilient to noise and improving overall performance.
Implications and Applications
Real-World Scenarios
RAFT's implications are vast, particularly in fields requiring precise information retrieval from large document collections. For example:
- Healthcare: Doctors can use RAFT-enhanced models to quickly retrieve and verify information from medical research papers.
- Legal: Lawyers can leverage these models to find relevant case laws and precedents efficiently.
- Enterprise: Companies can use RAFT-trained models to navigate through vast internal documentation, improving operational efficiency.
Future Research
While RAFT has shown promising results, there's room for further exploration:
- Optimal Training Mix: Determining the ideal proportion of golden to distractor documents for various domains can further refine RAFT's effectiveness.
- Generalization: Examining how well RAFT generalizes across different retrieval mechanisms and document types can provide deeper insights.
Conclusion
RAFT represents a significant step forward in adapting LLMs for domain-specific tasks. By fine-tuning models with a blend of relevant and irrelevant documents and incorporating chain-of-thought reasoning, RAFT not only enhances accuracy but also robustness against distractors. This innovative approach holds immense potential for various real-world applications, making it a valuable tool for engineers and researchers alike.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →