research-papers

LLM evaluation too expensive? Here's how we solve this.

Athina AI

15 Feb 2024 — 2 min read

According to the latest and greatest research, the SoTA eval techniques use LLMs as evaluators. It's a little meta, but it works better than anything else.

But sometimes, they are too expensive to run in production. Here's how Athina solves this.

How to run LLM-graded evals in production

Without blowing up your OpenAI budget.

Here's how we help you solve this so you can still get model performance insights in production!

Sampling Percentage: You can get similar insights by running evals on 100k inferences instead of 500k inferences. We have a configuration setting for this on Athina. Documentation.
Max Evals per month: Set a maximum number of evaluations to run per month as a hard limit. This ensures your costs are always limited.
Filters: When you configure Evals on Athina, you can set filters. These filters will ensure that evals ONLY run on inferences that match the filters. For example, you could set a filter to only run evals on inferences where:
user query contains "refund"
language model is gpt-4
prompt slug is summarization/v3
Use a cheaper model: Many evals will work great with GPT-3.5 (though some will not). That shaves the cost down to 1/10 of GPT-4 Turbo. If you want to run evals on even cheaper models (Llama, Mistral, etc), then reply to this email. We're working on it :)

Function Evals

We just shipped a whole new library of function evals. These evals do NOT use LLMs, which means they are deterministic and FREE. Here are some examples of the function evals we've just shipped:

Contains [All / Any / None]: Checks if response contains (or does not contain) some keywords.
Regex: Checks if response contains a regex pattern
Contains Valid Link: Checks that a link exists AND is valid (not a hallucinated link)
Contains JSON: Checks if response contains JSON
Contains Email: Checks if response contains JSON
Answer similarity: Similarity between the response and expected_response.
API Call: Bring your own eval. We'll hit your endpoint to run an eval.

You can set these evals up on our UI or through our open-source SDK.

View the documentation here.

Want Early Access to Our Latest Features?

We're working on some exciting new things.

Hint: It rhymes with shine-tuning.

If you want early access, or want to see what we're doing, email me at shiv@athina.ai.

Cheers,

Shiv Sakhuja

How a Founder ran 100+ Voice Interviews in 48 Hours — without a Single Zoom Call, Powered by Dialog

Founders are busy, constantly juggling priorities — building product, talking to users and most important Hiring..... Though its the most essential task, but most of the times it becomes a time sink. Especially when you’re looking for people not just with the right skills, right spirit and high agency. That’

Top 10 AI Agent Papers of the Week: 10th April - 18th April

As we go deep into April, the AI Agent landscape continues to evolve at an sky rocket pace, with groundbreaking research shaping the future of intelligent systems. In this article, we spotlight the Top 10 Cutting-Edge Research Papers on AI Agents from this week, breaking down key insights, examining their

Top 10 AI Agent Papers of the Week: 1st April - 8th April

As April begins, the AI Agent landscape continues to evolve at an historic pace, with groundbreaking research shaping the future of intelligent systems. In this article, we spotlight the Top 10 Cutting-Edge Research Papers on AI Agents from this week, breaking down key insights, examining their impact, and highlighting their

Top 10 AI Agents Papers from March 2025

AI Agents are rapidly advancing in intelligence, speed, and autonomy, with cutting-edge research paving the way for their future evolution. We’ve selected 10 most relevant papers out of total 545 Agent papers released in March on Arxiv that tackle key challenges like governance, collaboration, reasoning, and automation. These papers