Discovering Preference Optimization Algorithms with and for Large Language Models
Original Paper: https://arxiv.org/abs/2406.08414
By: Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Foerster, Mihaela van der Schaar, Robert Tjarko Lange
Abstract:
Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions.
While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms.
The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.
Summary Notes
In the realm of machine learning, optimizing the performance of large language models (LLMs) to align with human preferences is a critical task.
As LLMs become increasingly integrated into applications ranging from chatbots to content generation tools, ensuring that these models produce high-quality, ethical, and value-aligned outputs is paramount.
Traditionally, preference optimization relies on manually-crafted loss functions designed by human experts, but this approach is inherently limited by human creativity and expertise.
A recent study proposes a groundbreaking method to automate the discovery of new preference optimization algorithms using LLMs themselves.
Introduction
Training LLMs typically involves fine-tuning pre-trained models on large text corpora to better align with human preferences. However, even instruction-fine-tuned LLMs can generate harmful or unethical outputs.
To mitigate this, techniques like reinforcement learning with human feedback (RLHF) have been employed, and more recently, offline preference optimization algorithms have been developed.
These algorithms cast the problem as a supervised learning objective, aiming to enhance model outputs based on human values without extensive real-time human intervention.
Methodology: LLM-Driven Objective Discovery
The novel approach introduced in the study involves using LLMs to autonomously discover new preference optimization algorithms. This is achieved through an iterative process where an LLM proposes and evaluates new loss functions based on previously assessed performance metrics.
Initial Context Construction
The process begins by "burning-in" the LLM with several established objective functions and their corresponding performance metrics. This provides a foundational context that the LLM can build upon.
Iterative Refinement and Evaluation
- LLM Querying: The LLM proposes a new candidate objective function.
- Output Validation: The proposed function undergoes a set of unit tests to ensure validity.
- Performance Evaluation: The function is used to fine-tune an LLM, and its performance is evaluated on a downstream task.
- Feedback Loop: The performance data is fed back to the LLM, which refines its strategy in subsequent iterations.
This iterative process continues until a set of high-performing loss functions is discovered. The study introduces a particularly strong new algorithm named Discovered Preference Optimization (DiscoPOP), which adaptively blends logistic and exponential losses.
Key Findings and Results
The experiments demonstrate that DiscoPOP achieves state-of-the-art performance across multiple tasks, including single-turn dialogue and summarization. Here's a summary of the key findings:
Performance on Multi-Turn Dialogue (MT-Bench)
DiscoPOP and other discovered objective functions were evaluated on the MT-Bench benchmark, a multi-turn dialogue evaluation using GPT-4 as the judge. DiscoPOP showed a notable improvement in performance, scoring higher than many existing algorithms.
Single-Turn Dialogue (Alpaca Eval 2.0)
On the Alpaca Eval 2.0 benchmark, which uses GPT-4 Turbo to assess the win rate of LLM completions, DiscoPOP outperformed baselines, demonstrating its effectiveness in generating high-quality, preference-aligned outputs.
Summarization (TL;DR)
In the summarization task, DiscoPOP and another discovered loss function, PADLL, performed competitively, showing that these new algorithms generalize well across different types of preference optimization tasks.
Positive Sentiment Generation (IMDb)
DiscoPOP also excelled in generating positive sentiment movie reviews, outperforming traditional methods in terms of rewards and KL-Divergence, especially at moderate regularization levels.
Implications and Applications
The ability to automatically discover high-performing preference optimization algorithms has significant implications:
- Efficiency: Reduces the need for continuous human expert intervention in developing new optimization techniques.
- Scalability: Facilitates the deployment of aligned LLMs in various applications, from virtual assistants to automated content moderation tools.
- Innovation: Encourages the exploration of novel loss functions that might not be conceived through manual design.
Conclusion
The study showcases the potential of LLM-driven objective discovery to revolutionize the field of preference optimization. DiscoPOP, with its adaptive blending of logistic and exponential losses, stands out as a highly effective algorithm that enhances the alignment of LLM outputs with human values.
This approach not only advances the state-of-the-art but also opens up new avenues for research and application in the optimization of LLMs.
As this technology continues to evolve, it is crucial to address its limitations, such as ensuring stability across different regularization parameters and exploring multi-parameter optimization.
Future work could also involve using the discovered models to generate further improvements, pushing the boundaries of what LLMs can achieve in aligning with human preferences.
Quote from the study: "By leveraging the extensive knowledge embedded in LLMs, we can automatically discover novel state-of-the-art preference optimization algorithms without continual expert human intervention."
The full code and additional details of this study are available at DiscoPOP GitHub repository.