research-papers

Self-Taught Evaluators

Athina AI

08 Aug 2024 — 4 min read

Original Paper: https://arxiv.org/abs/2408.02666

By: Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li

Abstract:

Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation.

To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve.

In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only.

Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions.

Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench.

This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

Summary Notes

Introduction

In the fast-paced world of AI and machine learning, evaluating the performance of large language models (LLMs) is crucial for their development and refinement.

Traditional methods rely heavily on human-generated data, which is both costly and time-consuming.

But what if we could bypass this bottleneck? Enter the world of Self-Taught Evaluators – a novel approach that uses synthetic data to train evaluators without human annotations.

This method not only promises to cut down costs but also aims to keep up with the rapid advancements in LLMs.

The Self-Taught Evaluator: A Paradigm Shift

The Challenge

LLMs have made remarkable strides in recent years, but their evaluation remains a significant hurdle.

The standard approach involves collecting human preference judgments over model responses, which becomes outdated as models improve.

Moreover, this dependency on human-generated data poses scaling challenges for new tasks or evaluation criteria.

The Solution

The Self-Taught Evaluator proposes an iterative self-training approach that eliminates the need for human-annotated preferences.

Instead, it relies purely on synthetically generated data. The process involves generating contrasting model outputs, using an LLM-as-a-Judge to produce reasoning traces and final judgments, and repeating this training iteratively with improved predictions.

Methodology: Building the Self-Taught Evaluator

Initialization

The process begins with a large set of human-written user instructions and a seed LLM.

These instructions typically involve various skills such as general knowledge, reasoning, coding, safety, and mathematical reasoning.

Instruction Selection

To ensure quality synthetic responses and judgments, a subset of instructions is selected from the uncurated set.

These instructions are categorized using an LLM, focusing on challenging and balanced distributions.

Response Pair Construction

For each selected user instruction, a preference pair of two model responses is created. One response is designed to be superior, and the other inferior, generated via prompting.

This ensures that the pairs are synthetic but meaningful.

Iterative Training

The core of the Self-Taught Evaluator is its iterative training loop:

Judgment Annotation: For each training example, multiple evaluations are sampled from the model. Correct judgments are kept, while incorrect ones are discarded.
Model Fine-tuning: The model is fine-tuned on the newly constructed training set, resulting in an improved model for the next iteration.

This iterative process continues until the model reaches a satisfactory level of performance.

Key Findings and Results

The Self-Taught Evaluator was tested on various benchmarks, and the results were promising:

RewardBench

Starting from a baseline accuracy of 75.4%, the Self-Taught Evaluator improved to 88.3% after five iterations. With majority voting using 32 samples, it achieved an impressive 88.7%, outperforming many existing reward models.

MT-Bench

On MT-Bench, the Self-Taught Evaluator achieved an agreement rate of 79.5% with human judgments, on par with GPT-4, a state-of-the-art model.

HelpSteer2

When evaluated on the HelpSteer2 validation set, the Self-Taught Evaluator showed significant improvements in both average accuracy and position-consistent accuracy compared to the seed model.

Implications and Applications

The Self-Taught Evaluator holds several implications for the future of AI development:

Cost Efficiency: By eliminating the need for human-annotated data, it significantly reduces the cost and time involved in model evaluation.
Scalability: The approach can easily adapt to new tasks and evaluation criteria, making it highly scalable.
Continuous Improvement: The iterative nature ensures that the evaluator keeps improving alongside the models it evaluates.

Real-World Applications

AI Research: Facilitates rapid experimentation and refinement of LLMs.
Industry: Streamlines the deployment of AI models in various applications, from customer service to content generation.
Education: Assists in developing educational tools that require constant updates and improvements.

Conclusion

The Self-Taught Evaluator represents a significant leap forward in the field of AI model evaluation.

By leveraging synthetic data and iterative self-training, it provides a scalable, cost-effective, and continuously improving evaluation framework.

As AI models continue to evolve, such innovative approaches will be essential in ensuring their effectiveness and reliability.

Quote from the Research Paper: "Our Self-Taught Evaluator with iterative training over synthetic preferences greatly boosts the accuracy of a strong seed LLM (Llama3-70B-Instruct) from 75.4 to 88.7 on RewardBench, setting a new state-of-the-art for generative LLM-as-a-Judge methods."

Future Research

While the Self-Taught Evaluator shows great promise, future research could explore its application to smaller models and other types of evaluations, such as single-response scoring.

By embracing such cutting-edge techniques, we can ensure that the next generation of AI models are not only powerful but also reliably evaluated and improved.