LLM evaluation too expensive? Here's how we solve this.

LLM evaluation too expensive? Here's how we solve this.
Photo by Google DeepMind / Unsplash


According to the latest and greatest research, the SoTA eval techniques use LLMs as evaluators. It's a little meta, but it works better than anything else.

But sometimes, they are too expensive to run in production. Here's how Athina solves this.

How to run LLM-graded evals in production

  • Without blowing up your OpenAI budget.

Here's how we help you solve this so you can still get model performance insights in production!

  • Sampling Percentage: You can get similar insights by running evals on 100k inferences instead of 500k inferences. We have a configuration setting for this on Athina. Documentation.
  • Max Evals per month: Set a maximum number of evaluations to run per month as a hard limit. This ensures your costs are always limited.
  • Filters: When you configure Evals on Athina, you can set filters. These filters will ensure that evals ONLY run on inferences that match the filters. For example, you could set a filter to only run evals on inferences where:
  • user query contains "refund"
  • language model is gpt-4
  • prompt slug is summarization/v3
  • Use a cheaper model: Many evals will work great with GPT-3.5 (though some will not). That shaves the cost down to 1/10 of GPT-4 Turbo. If you want to run evals on even cheaper models (Llama, Mistral, etc), then reply to this email. We're working on it :)
image

Function Evals

We just shipped a whole new library of function evals. These evals do NOT use LLMs, which means they are deterministic and FREE. Here are some examples of the function evals we've just shipped:

  • Contains [All / Any / None]: Checks if response contains (or does not contain) some keywords.
  • Regex: Checks if response contains a regex pattern
  • Contains Valid Link: Checks that a link exists AND is valid (not a hallucinated link)
  • Contains JSON: Checks if response contains JSON
  • Contains Email: Checks if response contains JSON
  • Answer similarity: Similarity between the response and expected_response.
  • API Call: Bring your own eval. We'll hit your endpoint to run an eval.

You can set these evals up on our UI or through our open-source SDK.

View the documentation here.

image

Want Early Access to Our Latest Features?

We're working on some exciting new things.

Hint: It rhymes with shine-tuning.

If you want early access, or want to see what we're doing, email me at shiv@athina.ai.

Cheers,

Shiv Sakhuja

Read more