Introduction to RAG Evaluation Metrics
Retrieval-Augmented Generation (RAG) is becoming popular for building applications with large language models (LLMs). RAG improves generative models by using external information to make their responses more accurate and context-aware. This guide explains how basic RAG works and how to evaluate the RAG pipeline with specific metrics, using simple steps and examples.
Specifically, we will cover the following:
- Basic Retrieval Augmented Generation workflow.
- Why we need to evaluate RAG applications.
- Types of evaluating metrics with code examples.
Basic RAG
The idea behind the RAG framework is to combine a retrieval model and a generative model. The retrieval model searches through large external knowledge bases to find relevant information, while the generative model uses this retrieved data to generate more accurate and contextually relevant responses.
This hybrid approach allows RAG models to overcome the limitations of traditional LLMs in response generation, such as their reliance solely on pre-trained knowledge, which can be outdated or inadequate for handling context-specific queries.
RAG uses up-to-date external information, allowing it to generate more accurate and flexible responses. It handles new or specific topics better than models limited to pre-existing data.
The basic RAG workflow consists of the following elements:
Indexing
Indexing is the first important step in preparing data for language models. The original data is cleaned, converted to plain text, and broken into smaller chunks for easier processing.
These chunks are then turned into vector representations using an embedding model, which helps compare similarities during retrieval. The final index stores these text chunks and their vector embeddings, allowing for efficient and scalable searches.
Retrieval
When a user asks a question, the system uses the encoding model from the indexing phase to convert the question into a vector. It then calculates similarity scores between this vector and the vectorized chunks in the index.
The system retrieves the top K chunks with the highest similarity scores, which are used to provide context for the user’s request.
Generation
The user’s question and the selected documents are combined to create a clear prompt for a large language model. The model then crafts a response, adjusting its approach based on the specific task.
Now that we have a basic understanding of how the RAG workflow operates, let's move on to the evaluation phase.
Why do we need to evaluate RAG applications?
Evaluating RAG applications is important for understanding how well these pipelines work. We can see how effectively they combine information retrieval with generative models by checking their accuracy and relevance.
This evaluation helps improve RAG applications in tasks like text summarization, chatbots, and question-answering. It also identifies areas for improvement, ensuring that these systems provide trustworthy responses as information changes.
Overall, effective evaluation helps optimize performance and builds confidence in RAG applications for real-world use.
How to Evaluate RAG applications?
To evaluate a RAG application we focus on these main elements:
- Retrieval: Experiment with various data processing strategies and embedding models to see how they affect retrieval performance.
- Generation: After selecting the best settings for retrieval, test different large language models (LLMs) to find the best model for generating completions for the task.
To evaluate these elements, we will focus on the key metrics commonly used in RAG evaluation:
- Context Precision
- Context Recall
- Context Relevancy
- Answer Relevancy
- AnswerSemantic Similarity
- Answer Correctness
- Faithfulness
- Aspect Critic
Let's break down each of the evaluation metrics one by one.
Context Precision
Context Precision evaluates how well a retrieval system ranks the relevant pieces/chunks of information compared to the ground truth.
This metric is calculated using the query, ground truth, and context. The scores range from 0 to 1, with higher scores showing better precision.
Formula:
k = retrieved chunks (or contexts) are relevant to the task
K = The total number of chunks in the retrieved contexts
Example:
Consider we have 3 different examples which are in list format:
Questions = What is SpaceX?, Who found it?, What exactly does SpaceX do?
Answers = It is an American aerospace company, SpaceX founded by Elon Musk, SpaceX produces and operates the Falcon 9 and Falcon rockets
Contexts = SpaceX is an American aerospace company founded in 2002, SpaceX founded by Elon Musk, is worth nearly $210 billion, The full form of SpaceX is Space Exploration Technologies Corporation
Ground Truth = SpaceX is an American aerospace company, Founded by Elon Musk, SpaceX produces and operates the Falcon 9 and Falcon Heavy rockets
Solution:
For Question 1: The Ground Truth is relevant to the Context. So, this is a true positive (TP), and there are no false positives (FP). Therefore, the context precision here is 1.
Similarly, for Question 2, the context precision is 1.
But,
For Question 3: The Ground Truth is not relevant to the Context. Therefore, this is a false positive (FP), with no true positives (TP). Thus, the context precision here is 0.
For the average contextual precision with K=3, we assume equal weights Vk=1, so our final answer becomes:
Code:
Context Precision using Athina AI:
First, install the athina
package:
pip install --upgrade athina
Then, set your API keys:
from athina.keys import AthinaApiKey, OpenAiApiKey
OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))
Finally, we can run evals like this:
from athina.loaders import Loader
from athina.evals import RagasContextPrecision
data = [
{
"query": "What is SpaceX?",
"context": ['SpaceX is an American aerospace company founded in 2002'],
"expected_response": "SpaceX is an American aerospace company"
},
{
"query": "Who found it?",
"context": ['SpaceX, founded by Elon Musk, is worth nearly $210 billion'],
"expected_response": "Founded by Elon Musk."
},
{
"query": "What exactly does SpaceX do?",
"context": ['The full form of SpaceX is Space Exploration Technologies Corporation'],
"expected_response": "SpaceX produces and operates the Falcon 9 and Falcon Heavy rockets"
},
]
# Load the data from CSV, JSON, Athina or Dictionary
dataset = RagasLoader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasContextPrecision(model=eval_model).run_batch(data=dataset).to_df()
Context Recall
Context recall measures how many relevant documents or pieces of information were retrieved. It helps evaluate if the retrieved context includes the key facts from the ground truth.
The score ranges from 0 to 1, where higher values indicate better performance.
Formula:
Example:
Question = What is SpaceX and Who found it?
Answer = It is an American aerospace company founded by Elon Musk
Context = SpaceX is an American aerospace company founded in 2002.
Ground Truth = SpaceX is an American aerospace company founded by Elon Musk.
Solution:
First, break down the sentence from Ground truth and check them against the context.
Statement 1: "SpaceX is an American aerospace company" (Yes, it is in the context)
Statement 2: "Founded by Elon Musk" (No, this part is not in the context)
In this example, we have 1st statement that is present in the context, while the 2nd statement is not present in the context.
Therefore, the context recall here is:
Code:
Context Recall using Athina AI:
from athina.evals import RagasContextRecall
data = [
{
"query": "What is SpaceX and Who found it?",
"context": ['SpaceX is an American aerospace company founded in 2002'],
"expected_response": "SpaceX is an American aerospace company founded by Elon Musk"
},
{
"query": "What exactly does SpaceX do?",
"context": ['SpaceX produces and operates the Falcon 9 and Falcon rockets'],
"expected_response": "SpaceX produces and operates the Falcon 9 and Falcon Heavy rockets"
},
]
dataset = Loader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasContextRecall(model=eval_model).run_batch(data=dataset).to_df()
Context Relevancy
Context Relevancy metric evaluates how relevant the retrieved context is to a given input question. The score ranges from 0 to 1, where higher values indicate better performance.
The calculation is based on the number of relevant statements in the retrieved context compared to the total number of statements.
Formula:
Please note that Number of Extracted Sentences from the context are nothing but the relevant sentence that provides the answer.
Example:
Questions = What is SpaceX and Who found it
Answers = It is an American aerospace company founded by Elon Musk.
Contexts = SpaceX is an American aerospace company founded in 2002, Founded by Elon Musk, The full form of SpaceX is Space Exploration Technologies Corporation
Ground Truth = SpaceX is an American aerospace company founded by Elon Musk.
Solution:
Statement 1: SpaceX is an American aerospace company founded in 2002 (relevant)
Statement 2: Founded by Elon Musk (relevant)
Statement 3: The full form of SpaceX is Space Exploration Technologies Corporation (Not relevant)
So now we have 2 relevant statements to the question and 1 is not directly relevant to the question.
Let's calculate the context relevancy for this example:
Number of Extracted Sentences = 2
Total Number of Sentences in Context = 3
Code:
Context Relevancy using Athina AI:
from athina.evals import RagasContextRelevancy
data = [
{
"query": "What is SpaceX and Who found it",
"context": ["SpaceX is an American aerospace company founded in 2002", "Founded by Elon Musk", "The full form of SpaceX is Space Exploration Technologies Corporation"],
},
]
# Load the data from CSV, JSON, Athina or Dictionary
dataset = Loader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasContextRelevancy(model=eval_model).run_batch(data=dataset).to_df()
Answer Relevancy
Answer Relevancy is calculated as the average cosine similarity between the original question and a set of generated questions, where the generated questions are created using the answer from the RAG model as a reference.
Lower scores are assigned to incomplete or irrelevant answers, and higher scores indicate better relevancy.
It is calculated as the average cosine similarity between the original question and a set of artificially generated questions. These generated questions are created by reverse-engineering based on the answer.
Formula:
Where:
- Ag = embedding of the generated question
- Ao = embedding of the original question.
- N = number of generated questions.
Example:
Questions = What is SpaceX?
Answers = It is an American aerospace company founded in 2002.
Solution:
To calculate the relevance of an answer to the given question, the process involves the following steps:
Step 1: Reverse-engineer ‘n’ variations of the question from the provided answer using a Large Language Model (LLM). This involves generating alternative samples of the original question based on the information contained in the answer.
For examples:
- Question 1: “Which company was founded in 2002 and operates in aerospace?”
- Question 2: “What American company in aerospace was established in 2002?”
- Question 3: “Which U.S. company focused on aerospace was started in 2002?”
Step 2: After that answer relevancy metric is calculated using the mean cosine similarity between the generated questions and the original question.
Code:
Answer Relevancy using Athina AI:
from athina.evals import RagasAnswerRelevancy
data = [
{
"query": "What is SpaceX?",
"context": ['SpaceX is an American aerospace company founded in 2002'],
"response": "It is an American aerospace company"
},
{
"query": "Who found it?",
"context": ['SpaceX, founded by Elon Musk, is worth nearly $210 billion'],
"response": "SpaceX founded by Elon Musk"
},
{
"query": "What exactly does SpaceX do?",
"context": ['The full form of SpaceX is Space Exploration Technologies Corporation'],
"response": "SpaceX produces and operates the Falcon 9 and Falcon rockets"
},
]
dataset = Loader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasAnswerRelevancy(model=eval_model).run_batch(data=dataset).to_df()
Answer Semantic Similarity
Answer Semantic Similarity refers to the similarity between the embedding of the RAG response and the embedding of the ground truth answer. The score ranges from 0 to 1, with higher scores showing a better match between the two answers.
This evaluation uses a cross-encoder model to calculate the semantic similarity score.
Example:
Questions = What is SpaceX?
Answers = It is an American aerospace company.
Ground Truth = SpaceX is an American aerospace company.
Solution:
To calculate the answer similarity model follows the following steps:
- Step 1: Use the specified embedding model to convert the ground truth answer into a numerical vector representation that captures its semantic meaning.
- Step 2: Similarly, vectorize the generated answer using the same embedding model.
- Step 3: Calculate the cosine similarity between the two vectors to quantify how closely the generated answer aligns with the ground truth.
Code:
Answer Semantic Similarity using Athina AI:
from athina.evals import RagasAnswerSemanticSimilarity
data = [
{
"response": "It is an American aerospace company",
"expected_response": "SpaceX is an American aerospace company"
},
{
"response": "SpaceX founded by Elon Musk",
"expected_response": "Founded by Elon Musk"
},
{
"response": "SpaceX produces and operates the Falcon 9 and Falcon rockets",
"expected_response": "SpaceX produces and operates the Falcon 9 and Falcon Heavy rockets"
},
]
dataset = Loader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasAnswerSemanticSimilarity(model=eval_model).run_batch(data=dataset).to_df()
Answer Correctness
The Answer Correctness metric evaluates the accuracy and relevance of a generated response compared to the ground truth. It combines two components:
- Factual Correctness: Measures how factually accurate the response is by comparing claims in the response to those in the reference.
- Semantic Similarity: Evaluates how well the meaning of the generated answer aligns with the reference.
These weights can be adjusted to reflect the importance of each component in the overall evaluation.
Metric Calculation
Factual Correctness:
- True Positives (TP): Facts present in both the generated response and reference.
- False Positives (FP): Facts present in the generated response but not in the reference.
- False Negatives (FN): Facts present in the reference but missing from the generated response.
Factual Correctness Score can be calculated using precision, recall, and F1 score.
Semantic Similarity:
- Measures the alignment of meaning between the response and the reference using embeddings or LLM-based similarity scores.
Example:
Question = Who is the founder of SpaceX?
Ground Truth = SpaceX was founded by Elon Musk.
Generated Answers = SpaceX was founded by the Tesla Company founder.
Solution:
To find the answer correctness, we follow these steps:
- TP: SpaceX was founded by Elon Musk.
- FP: Tesla Company founder.
- FN: None (all critical information is included).
Then we calculate Precision, Recall, and F1, and combine them with semantic similarity for the final score.
Code:
Answer Correctness using Athina AI:
from athina.evals import RagasAnswerCorrectness
data = [
{
"query": "What is SpaceX?",
"response": "It is an American aerospace company",
"expected_response": "SpaceX is an American aerospace company"
},
{
"query": "Who found it?",
"response": "SpaceX founded by Elon Musk",
"expected_response": "Founded by Elon Musk."
},
{
"query": "What exactly does SpaceX do?",
"response": "SpaceX produces and operates the Falcon 9 and Falcon rockets",
"expected_response": "SpaceX produces and operates the Falcon 9 and Falcon Heavy rockets"
},
]
dataset = Loader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasAnswerCorrectness(model=eval_model).run_batch(data=dataset).to_df()
Faithfulness
The faithfulness metric evaluates whether the generated answer can be found in the provided context. The score ranges from 0 to 1, where a higher score indicates better factual consistency.
Formula:
Example:
Questions = What is SpaceX and Who found it
Answers = It is an American aerospace company. It was founded by Nikola Tesla.
Contexts = SpaceX is an American aerospace company founded in 2002, Founded by Elon Musk, The full form of SpaceX is Space Exploration Technologies Corporation.
Solution:
Now let’s calculate faithfulness:
First, break the generated answer into individual statements/claims.
Statement 1: It is an American aerospace company.
Statement 2: It was founded by Nikola Tesla.
After that, check each one to see if it can be supported by the given context.
Statement 1: Yes (supported by the context)
Statement 1: No (not supported by the context)
The final score for Faithfulness is:
This shows how well the generated answers are grounded in the provided context.
Code:
Faithfulness using Athina AI:
from athina.evals import RagasFaithfulness
data = [
{
"query": "What is SpaceX?",
"context": ['SpaceX is an American aerospace company founded in 2002'],
"response": "It is an American aerospace company"
},
{
"query": "Who found it?",
"context": ['SpaceX, founded by Elon Musk, is worth nearly $210 billion'],
"response": "SpaceX founded by Elon Musk"
},
{
"query": "What exactly does SpaceX do?",
"context": ['The full form of SpaceX is Space Exploration Technologies Corporation'],
"response": "SpaceX produces and operates the Falcon 9 and Falcon rockets"
},
]
dataset = Loader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasFaithfulness(model=eval_model).run_batch(data=dataset).to_df()
Aspect Critic
Aspect Critic is a collection of aspects detected by the Large Language Model.
There are 4 aspect critiques supported by Athina, which are listed below:
Harmfulness: Does the submission cause or have the potential to cause harm to individuals, groups, or society at large?
Maliciousness: Is the submission intended to harm, deceive, or exploit users?
Coherence: Does the submission present ideas, information, or arguments in a logical and organized manner?
Conciseness: Does the submission convey information or ideas clearly and efficiently, without unnecessary or redundant details?
The result is a simple "yes" (1) or "no" (0) indicating whether the submission meets the criteria. This process uses the answer as input for evaluation.
Example:
Questions = What is SpaceX and Who found it
Answers = It is an American aerospace company founded by Elon Musk.
Contexts = SpaceX is an American aerospace company founded in 2002, Founded by Elon Musk, The full form of SpaceX is Space Exploration Technologies Corporation
Ground Truth = SpaceX is an American aerospace company founded by Elon Musk.
Solution:
Harmfulness:
To check if an answer is harmful, the system runs multiple checks, asking: "Does this submission cause harm?" If most responses agree, the final verdict is given based on the majority vote.
The same procedure applies to all other aspects.
Code:
Aspect Critic using Athina AI:
# You can replace 'metrics' as needed
from athina.evals import RagasHarmfulness, RagasMaliciousness, RagasConciseness, RagasCoherence
data = [
{
"query": "What is SpaceX and Who found it?",
"context": ['SpaceX is an American aerospace company founded in 2002'],
"response": "It is an American aerospace company founded by Elon Musk",
"expected_response": "SpaceX is an American aerospace company founded by Elon Musk"
},
{
"query": "What exactly does SpaceX do?",
"context": ['SpaceX produces and operates the Falcon 9 and Falcon rockets'],
"response": "SpaceX produces and operates the Falcon 9 and Falcon rockets",
"expected_response": "SpaceX produces and operates the Falcon 9 and Falcon Heavy rockets"
},
]
dataset = Loader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasHarmfulness(model=eval_model).run_batch(data=dataset).to_df()
In this guide, we covered the basic steps of RAG, including indexing, retrieval, and generation, and why evaluating RAG applications is important. By using metrics like Context Precision, Answer Correctness and others, we can check how well the model works and improve its performance. Proper evaluation helps ensure RAG models give reliable and accurate results for tasks like answering questions and chatbots.