athina-originals

Evaluating a New Claude Sonnet 3.5 Model in Athina AI

Prasad Mahamulkar

24 Oct 2024 — 4 min read

The new Claude Sonnet 3.5 model demonstrates significant improvements, particularly in mathematical reasoning, coding, visual understanding, and agentic tool use.

These enhancements make it better at handling complex problems and more versatile across various tasks. Users can expect improved accuracy with this version.

In this tutorial, we will guide you through the process of evaluating the new Claude Sonnet 3.5 on a sample dataset in the Athina AI.

Athina AI is a platform that helps teams build AI features for production. Athina makes it easy to prototype, evaluate, experiment, and monitor AI models all in one place, whether you're a technical or non-technical user.

Athina’s key features include Dynamic Columns, where you can run prompts, execute code, make API calls, and more, all within a spreadsheet-style interface. It also comes with over 50 built-in evaluation metrics, and you can create your custom ones if needed.

Plus, Athina continuously tracks how your models are performing, helping you keep everything running smoothly in production.

For this tutorial, we will use the MMLU-Pro-updated sample dataset, which contains multiple-choice questions across various domains, including mathematics, history, and science.

The MMLU-Pro is designed to challenge models with tasks requiring both factual knowledge and advanced reasoning, making it an excellent choice for evaluating the performance of new models like Claude Sonnet 3.5.

Now, let’s move on to the step-by-step model evaluation part.

0:00

/1:18

Video by Author

Step 1: Generate Response Using Claude Sonnet 3.5

First, make sure that your dataset is properly formatted and loaded into the environment. Then, click on 'Add Column' and select the 'Run Prompt' option for the 'Select Column Type'.

After that, rename the column to 'sonnetnew_response' Then, select the language model as 'claude-3.5-sonnet-20241022'.

Next, add the following prompt in the 'System' and click on 'Create New Column':

""" You are a helpful assistant. Given the following options, please answer the question. Provide an accurate answer without explanation.
Questions: {{question}}
Options: {{options}}
Your Answer: """

This will start populating the responses generated by 'claude-3.5-sonnet-20241022' in the 'sonnetnew_response' column, as shown in the following image.

Now let's move on to the next step.

Step 2: Evaluate the Generated Responses

Once the responses have been generated, go to the 'Evaluate' section and click on 'Add a Saved Evaluation'. Then, search for the 'response = expected' evaluation metric and select it.

Next, in the input section, select 'sonnetnew_response' as the response and 'correct_answer' as the expected response, as shown in the image.

Then, select the model as 'gpt-4-turbo', click 'Save', and then click on 'Run Evaluation'.

This will start evaluating the generated answers against the correct answers in the 'response = expected' column, as you can see in the following image.

And finally, we can see the overall evaluation score at the bottom of the evaluation results section.

By following these detailed steps, we have evaluated the performance of the Claude Sonnet 3.5 model on the sample MMLU-Pro dataset in Athina AI.

Conclusion

In this tutorial, we successfully evaluated the Claude Sonnet 3.5 model using the MMLU-Pro sample dataset in the Athina AI IDE. We learned how to generate responses from the model and check its performance against the correct answers.

The evaluation process demonstrated how well the model handles complex reasoning tasks across various domains.

The improvements in mathematical reasoning, coding, visual understanding, and tool use make Claude Sonnet 3.5 a strong choice for various tasks.

We hope this guide has helped you understand how to evaluate AI models effectively using Athina AI

Thank you for following along!

Evaluating a New Claude Sonnet 3.5 Model in Athina AI

Prasad Mahamulkar

Step 1: Generate Response Using Claude Sonnet 3.5

Step 2: Evaluate the Generated Responses

Conclusion

Read more

How a Founder ran 100+ Voice Interviews in 48 Hours — without a Single Zoom Call, Powered by Dialog

Top 10 AI Agent Papers of the Week: 10th April - 18th April

Top 10 AI Agent Papers of the Week: 1st April - 8th April

Top 10 AI Agents Papers from March 2025