RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Original Paper: https://arxiv.org/abs/2408.01262

By: Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, Maosong Sun

Abstract:

Retrieval-Augmented Generation (RAG) systems have demonstrated their advantages in alleviating the hallucination of Large Language Models (LLMs). Existing RAG benchmarks mainly focus on evaluating whether LLMs can correctly answer the general knowledge. However, they are unable to evaluate the effectiveness of the RAG system in dealing with the data from different vertical domains. This paper introduces RAGEval, a framework for automatically generating evaluation datasets to evaluate the knowledge usage ability of different LLMs in different scenarios. Specifically, RAGEval summarizes a schema from seed documents, applies the configurations to generate diverse documents, and constructs question-answering pairs according to both articles and configurations. We propose three novel metrics, Completeness, Hallucination, and Irrelevance, to carefully evaluate the responses generated by LLMs. By benchmarking RAG models in vertical domains, RAGEval has the ability to better evaluate the knowledge usage ability of LLMs, which avoids the confusion regarding the source of knowledge in answering question in existing QA datasets--whether it comes from parameterized memory or retrieval. The code and dataset will be released.

Summary Notes

Introduction

In the dynamic landscape of Natural Language Processing (NLP), Large Language Models (LLMs) have achieved remarkable feats. However, these models often suffer from the hallucination problem, where they generate responses with factual errors. To mitigate this, Retrieval-Augmented Generation (RAG) systems have emerged, leveraging external data to enhance the factual accuracy of responses. Yet, existing benchmarks fall short in evaluating RAG systems across diverse vertical domains. Enter RAGEval, a groundbreaking framework designed to generate scenario-specific evaluation datasets, meticulously assessing the knowledge usage capabilities of different LLMs in varied contexts.

Key Methodologies

Schema Summary

The RAGEval framework initiates by summarizing a schema from a small set of domain-specific documents. This schema encapsulates essential domain-specific knowledge, ensuring the generated content is both professional and reliable. For instance, in the financial domain, the schema may include key elements such as organizations, events, and dates. This schema is then used to guide the generation of diverse and contextually accurate documents.

Document Generation

Generating high-quality documents with rich factual information is critical for effective evaluation. RAGEval employs a hybrid approach, combining rule-based methods and LLMs to generate configurations derived from the schema. These configurations ensure the generated texts are consistent and reflect the domain-specific knowledge accurately. For example, in medical records, the schema would include patient information, medical history, examination results, and treatment plans, ensuring comprehensive coverage of relevant information.

QRA Generation

To evaluate the RAG system's effectiveness, RAGEval generates Question-Reference-Answer (QRA) triples using the documents and configurations. This involves creating a diverse set of question types, including factual questions, multi-hop reasoning questions, and summarization questions. The goal is to assess various aspects of language understanding and information processing. The generated questions and initial answers are refined by extracting relevant references from the documents and ensuring alignment with the provided references, minimizing misinformation.

Main Findings and Results

Evaluation Metrics

RAGEval introduces three novel metrics to evaluate the quality of model responses: Completeness, Hallucination, and Irrelevance. These metrics offer a comprehensive assessment of the generated answers' factual accuracy and relevance.

Completeness measures how well the generated answer captures the key information from the ground truth.
Hallucination identifies instances where the content contradicts key points, highlighting potential inaccuracies.
Irrelevance assesses the proportion of key points from the ground truth that are neither covered nor contradicted by the generated answer.

Experimental Results

The experimental results demonstrate that the new metrics provide a more accurate assessment of model performance in RAG scenarios compared to conventional metrics like Rouge-L. Notably, GPT-4o showed superior performance overall, but the gap with top-performing open-source models was relatively small. For instance, in the Chinese evaluation, the 2B model MiniCPM-2B achieved a Completeness score of 0.4114, surpassing even larger models like Baichuan-2-7B-chat.

In the retrieval phase, the BGE-M3 model achieved the highest Recall of 0.8387 and Completeness of 0.6980 in the Chinese setting, indicating robust retrieval capabilities.

Implications and Potential Applications

Enhanced Evaluation Framework

RAGEval's comprehensive evaluation framework offers a more accurate and nuanced assessment of RAG systems, particularly in domains like finance, healthcare, and legal sectors. This can guide the development of more reliable and effective RAG models, tailored to specific industry needs.

Improvement in Open-Source Models

The findings suggest significant potential for improvement in open-source models. With further advancements, these models could closely match or even surpass the performance of proprietary models like GPT-4o, democratizing access to high-quality NLP tools.

Real-World Applications

By providing a robust framework for evaluating RAG systems, RAGEval can drive improvements in applications such as automated customer support, legal document analysis, and medical diagnostics, where accuracy and contextual understanding are paramount.

Conclusion

RAGEval represents a significant advancement in the evaluation of Retrieval-Augmented Generation systems. By focusing on factual accuracy and domain-specific knowledge, it addresses the limitations of existing benchmarks and provides a comprehensive framework for assessing RAG systems' effectiveness. The experimental results highlight the framework's robustness and the potential for further improvements in open-source models. As the NLP field continues to evolve, RAGEval offers a valuable tool for developing more reliable and contextually aware language models.

By introducing RAGEval, we take a crucial step towards refining the capabilities of LLMs in real-world applications. Whether you're an engineer working on the next big NLP project or a researcher looking to push the boundaries of what's possible, RAGEval offers the insights and tools you need to succeed.

Building an AI-powered product or feature?

Athina AI is a collaborative IDE for AI development.

Learn more about how Athina can help your team ship AI 10x faster →