research-papers

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Athina AI

27 Aug 2024 — 4 min read

Original Paper: https://arxiv.org/abs/2408.01262

By: Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, Maosong Sun

Abstract:

Retrieval-Augmented Generation (RAG) systems have demonstrated their advantages in alleviating the hallucination of Large Language Models (LLMs).

Existing RAG benchmarks mainly focus on evaluating whether LLMs can correctly answer the general knowledge. However, they are unable to evaluate the effectiveness of the RAG system in dealing with the data from different vertical domains.

This paper introduces RAGEval, a framework for automatically generating evaluation datasets to evaluate the knowledge usage ability of different LLMs in different scenarios.

Specifically, RAGEval summarizes a schema from seed documents, applies the configurations to generate diverse documents, and constructs question-answering pairs according to both articles and configurations.

We propose three novel metrics, Completeness, Hallucination, and Irrelevance, to carefully evaluate the responses generated by LLMs.

By benchmarking RAG models in vertical domains, RAGEval has the ability to better evaluate the knowledge usage ability of LLMs, which avoids the confusion regarding the source of knowledge in answering question in existing QA datasets--whether it comes from parameterized memory or retrieval. The code and dataset will be released.

Summary Notes

Figure: RAGEval Framework.

Introduction

Retrieval-Augmented Generation (RAG) systems have emerged as powerful tools in natural language processing (NLP), especially for enhancing the factual accuracy of large language models (LLMs).

However, current benchmarks often fall short in evaluating these systems across various specialized domains. Enter RAGEval—a novel framework designed to generate scenario-specific datasets that comprehensively assess the performance of RAG systems.

This blog post delves into the mechanics, innovations, and implications of RAGEval, a game-changer for vertical domain evaluations.

Key Methodologies

Schema Summary and Document Generation

RAGEval begins by summarizing a schema from a small set of domain-specific documents. This schema encapsulates essential domain-specific knowledge, ensuring that the generated documents maintain internal consistency and factual accuracy.

For instance, in the legal domain, the schema might include elements such as "court," "judge," "defendant," "crime details," and "judgment result."

Using this schema, RAGEval generates configurations that serve as templates for creating diverse documents. These configurations ensure that the generated texts are rich in factual details and logically coherent.

For example, a configuration in the financial domain might cover various industries like agriculture or aviation, ensuring a broad representation of business sectors.

QRA Generation

The framework then utilizes these configurations to generate Question-Reference-Answer (QRA) triples. This involves:

Initializing QA Pairs: Questions and initial answers are generated based on the configurations.
Extracting References: Relevant information fragments (references) are extracted from the documents to support the answers.
Optimizing Answers and References: Answers are refined to ensure they align with the provided references, minimizing misinformation.
Generating Keypoints: Key points are extracted from the standard answers, focusing on indispensable factual information, relevant inferences, and final conclusions.

Evaluation Metrics

RAGEval introduces three novel metrics to assess the quality of model responses:

Completeness: Measures how well the generated answer captures the key information from the ground truth.
Hallucination: Identifies instances where the content contradicts key points, highlighting potential inaccuracies.
Irrelevance: Assesses the proportion of key points from the ground truth that are neither covered nor contradicted by the generated answer.

These metrics provide a comprehensive evaluation of the RAG system's performance, ensuring that the generated answers are informative, accurate, and relevant.

Main Findings and Results

Performance Comparison

The study compared the performance of various RAG systems, including both open-source models like Baichuan-2-7B-chat and proprietary models like GPT-4.
The results showed that while GPT-4 performed best overall, the gap with top-performing open-source models was relatively small. For instance, GPT-4 achieved a Completeness score of 0.5187 in Chinese and 0.6845 in English, only marginally outperforming models like Qwen1.5-14B-chat and Llama3-8B-Instruct.

Impact of Retrieval Settings

The experiments also explored the impact of common RAG settings like TopK retrieval and chunk size. Higher TopK values improved Recall, leading to more complete and accurate responses with reduced hallucination.
For example, increasing TopK from 2 to 6 significantly improved Completeness scores in both Chinese and English.

Hyperparameter Tuning

Optimal chunk size varied between languages. Smaller chunks (e.g., 128 tokens) generally led to better retrieval metrics and lower hallucination in Chinese, while slightly larger chunks (e.g., 256 tokens) were more effective in English.
This highlights the importance of careful tuning in retrieval-augmented generation systems.

Implications and Potential Applications

RAGEval's comprehensive evaluation framework has significant implications across various vertical domains:

Finance: Enhances the accuracy of financial reports, ensuring that data-driven insights are reliable and verifiable.
Healthcare: Improves the quality of medical records by minimizing factual errors and ensuring comprehensive coverage of patient information.
Legal: Ensures that legal documents are accurate and consistent, reducing the risk of misinformation in case judgments and legal references.

By providing a rigorous and domain-specific evaluation, RAGEval enables the development of more robust and reliable RAG systems tailored to specialized industries.

Conclusion

RAGEval marks a significant advancement in the evaluation of Retrieval-Augmented Generation systems.

By addressing the limitations of existing benchmarks, it provides a more accurate and comprehensive assessment of RAG performance across various vertical domains.

As open-source models continue to improve, frameworks like RAGEval will be crucial in ensuring that these advancements translate into practical, real-world applications.

Think about this: As RAG systems become more integral to various industries, how can we further refine evaluation metrics to ensure they keep pace with the evolving landscape of NLP?