The answer Correctness metric evaluates the accuracy of a generated answer compared to the ground truth. It is calculated as a weighted sum of two main components:
- Factual Correctness: This evaluates the factual accuracy of the generated answer compared to the ground truth.
- Answer Similarity: This checks how closely the meaning of the generated answer aligns with the ground truth.
These weights can be adjusted to reflect the importance of each component in the overall evaluation.
Example:
Question = Who is the founder of SpaceX?
Ground Truth = SpaceX was founded by Elon Musk.
Generated Answers = SpaceX was founded by Elon Musk, Founded by Tesla Company founder, SpaceX was founded by Nikola Tesla.
Solution:
To find the answer correctness, we follow these steps:
Factual Correctness:
True Positives (TP): Facts in both the generated answer and ground truth.
Here TP = SpaceX was founded by Elon Musk
False Positives (FP): Facts in the generated answer but not in the ground truth
Here FP = Founded by Tesla Company founder
False Negatives (FN): Facts in the ground truth but not in the generated answer.
Example: FN = SpaceX was founded by Nikola Tesla.
We can use the F1 score formula to calculate the factual correctness based on TP, FP, and FN
Semantic Similarity:
After that, we calculate how closely the generated answer matches the ground truth in meaning using semantic similarity (refer to the answer semantic similarity).
Answer Correctness:
Once we have the semantic similarity, we take the weighted average of the semantic similarity and the factual similarity to calculate the final score.
The formula could look something like this:
Wf and Ws are the weights parameter.
Code:
Answer Correctness using RAGAS:
from datasets import Dataset
from ragas.metrics import answer_correctness
from ragas import evaluate
data_samples = {
'question': ['What is SpaceX?', 'Who found it?','What exactly does SpaceX do?'],
'answer': ['It is an American aerospace company', 'SpaceX founded by Elon Musk','SpaceX produces and operates the Falcon 9 and Falcon rockets'],
'ground_truth': ['SpaceX is an American aerospace company', 'Founded by Elon Musk','SpaceX produces and operates the Falcon 9 and Falcon Heavy rockets']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_correctness])
score.to_pandas()
Answer Correctness using Athina AI:
from athina.evals import RagasAnswerCorrectness
data = [
{
"query": "What is SpaceX?",
"response": "It is an American aerospace company",
"expected_response": "SpaceX is an American aerospace company"
},
{
"query": "Who found it?",
"response": "SpaceX founded by Elon Musk",
"expected_response": "Founded by Elon Musk."
},
{
"query": "What exactly does SpaceX do?",
"response": "SpaceX produces and operates the Falcon 9 and Falcon rockets",
"expected_response": "SpaceX produces and operates the Falcon 9 and Falcon Heavy rockets"
},
]
dataset = Loader().load_dict(data)
eval_model = "gpt-3.5-turbo"
RagasAnswerCorrectness(model=eval_model).run_batch(data=dataset).to_df()