research-papers

HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

Athina AI

09 Aug 2024 — 4 min read

Photo by Kanhaiya Sharma / Unsplash

Original Paper: https://arxiv.org/abs/2408.04948

By: Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, Dhagash Mehta

Abstract

Extraction and interpretation of intricate information from unstructured text data arising in financial applications, such as earnings call transcripts, present substantial challenges to large language models (LLMs) even using the current best practices to use Retrieval Augmented Generation (RAG) (referred to as VectorRAG techniques which utilize vector databases for information retrieval) due to challenges such as domain-specific terminology and complex formats of the documents.

We introduce a novel approach based on a combination, called HybridRAG, of the Knowledge Graphs (KGs) based RAG techniques (called GraphRAG) and VectorRAG techniques to enhance question-answer (Q&A) systems for information extraction from financial documents that are shown to be capable of generating accurate and contextually relevant answers.

Using experiments on a set of financial earning call transcripts documents which come in the form of Q&A format, and hence provide a natural set of pairs of ground-truth Q&As, we show that HybridRAG which retrieves context from both vector database and KG outperforms both traditional VectorRAG and GraphRAG individually when evaluated at both the retrieval and generation stages in terms of retrieval accuracy and answer generation.

The proposed technique has applications beyond the financial domain

Summary Notes

Figure 1.A schematic diagram describing the vector database creation of a RAG application.

Figure 2.A schematic diagram describing knowledge graph creation process of GraphRAG.

Introduction

In the realm of financial analysis, extracting and interpreting intricate information from unstructured text data, such as earnings call transcripts and financial reports, poses a significant challenge.

Traditional data analysis methods often struggle with the domain-specific language, multiple data formats, and unique contextual relationships inherent in these documents.

Even sophisticated Large Language Models (LLMs) face limitations in this domain.

Enter HybridRAG, a novel approach that integrates Knowledge Graphs (KGs) and Vector Retrieval techniques to enhance the performance of question-answer (Q&A) systems.

This blog post delves into the innovative HybridRAG system, exploring its methodologies, key findings, and implications for the financial industry.

The Challenge: Extracting Information from Financial Documents

Financial analysts rely on unstructured data sources such as news articles, earnings reports, and other financial documents to make informed investment decisions.

However, these sources are rife with domain-specific language, multiple data formats, and unique contextual relationships, making it difficult for models to extract meaningful insights.

Traditional Retrieval-Augmented Generation (RAG) techniques, which use vector databases for information retrieval, struggle with these complexities.

Limitations of Current RAG Techniques

Current RAG techniques involve retrieving relevant textual information to support generation tasks.

While effective in some scenarios, these techniques often fall short when applied to financial documents. The reasons include:

Domain-Specific Terminology: Financial documents contain specialized language that general-purpose models struggle to understand.
Complex Data Formats: Variations in terminology, format, and context across different documents make it challenging to extract coherent information.
Inconsistent Context Retrieval: The quality of the retrieved context can be inconsistent, leading to inaccuracies and incomplete analyses.

The HybridRAG Approach

HybridRAG combines the strengths of both Knowledge Graphs (GraphRAG) and vector-based RAG (VectorRAG) to create a more robust information extraction system.

Methodologies

VectorRAG:

Query and Retrieval: The process begins with a query related to external documents not part of the LLM's training dataset. These documents are divided into chunks and stored in a vector database.
Similarity Search: A similarity search within the vector database retrieves the most relevant chunks.
Context Integration: The top-ranked chunks are aggregated to provide context for the generative model.

GraphRAG:

Knowledge Extraction: Structured information is extracted from documents to create KGs, consisting of entities and their relationships.
Subgraph Retrieval: For a given query, a subgraph of relevant nodes and edges is extracted from the KG to provide context.
Context Encoding: The graph structure is encoded into embeddings that the model can interpret, integrating it with the model's internal knowledge.

HybridRAG:

Context Combination: Combines contexts retrieved from both VectorRAG and GraphRAG.
Answer Generation: The combined context is fed into the LLM to generate a response, leveraging the strengths of both retrieval methods.

Key Findings and Results

To evaluate the effectiveness of HybridRAG, experiments were conducted using earnings call transcripts from Nifty 50 companies.

The performance of HybridRAG was compared against traditional VectorRAG and GraphRAG methods using metrics such as Faithfulness, Answer Relevance, Context Precision, and Context Recall.

Results Summary

Faithfulness: Both GraphRAG and HybridRAG demonstrated high faithfulness, scoring 0.96, compared to 0.94 for VectorRAG.
Answer Relevance: HybridRAG outperformed others with a score of 0.96, followed by VectorRAG at 0.91 and GraphRAG at 0.89.
Context Precision: GraphRAG excelled with a score of 0.96, while HybridRAG scored 0.79 and VectorRAG scored 0.84.
Context Recall: Both VectorRAG and HybridRAG achieved perfect scores of 1, while GraphRAG scored 0.85.

These results indicate that HybridRAG offers a balanced and effective approach, combining high-quality answers with comprehensive context retrieval.

Implications and Applications

The implications of HybridRAG extend beyond financial analysis.

By developing a system capable of understanding and responding to nuanced queries about complex financial documents, we pave the way for more sophisticated AI-assisted decision-making tools.

These tools could democratize access to financial insights, allowing a broader range of stakeholders to engage with and understand financial information.

Real-World Applications

Financial Analysis: Enhancing the efficiency and accuracy of financial analysts by quickly gathering relevant data and identifying market trends.
Risk Management: Improved risk management by identifying hidden relationships and supporting advanced predictive analytics.
Automated Reporting: Facilitating automated report generation, sentiment analysis, and market trend predictions.

Conclusion

HybridRAG represents a significant advancement in information extraction from financial documents.

By integrating the strengths of Knowledge Graphs and vector retrieval techniques, it addresses the limitations of traditional RAG methods, providing more accurate and contextually relevant answers.

The potential applications of this approach are vast, from enhancing financial analysis to improving risk management and automated reporting.

As we continue to explore and refine HybridRAG, future directions include expanding the system to handle multi-modal inputs, incorporating numerical data analysis capabilities, and developing more sophisticated evaluation metrics.

The integration of real-time financial data streams could further enhance its utility in dynamic financial environments.

The journey of HybridRAG is just beginning, and its potential to transform financial analysis is immense.