Original Paper: https://arxiv.org/abs/2407.13647
By: Yuqing Yang, Yan Ma, Pengfei Liu
Abstract:
When large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervisions for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context. Yet, the efficacy of this approach for complex reasoning tasks is still untested. Furthermore, tackling reasoning tasks under the weak-to-strong setting currently lacks efficient methods to avoid blindly imitating the weak supervisor including its errors. In this paper, we introduce a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers.
Summary Notes
(a)Llama2-7b supervises Llama2-70b on GSM8K (Cobbe et al., 2021).
(b)Llama3-8b-instruct supervises Llama3-70b on OlympicArena (Huang et al., 2024).
Introduction
Artificial General Intelligence (AGI) aims to create superintelligent systems that surpass human cognitive capabilities.
As we edge closer to this ambitious goal, one fundamental challenge looms large: how do we supervise and train models that are more capable than their human supervisors?
This is where the concept of weak-to-strong learning comes into play. In this blog post, we delve into a groundbreaking research paper that introduces a novel weak-to-strong learning framework designed to enhance AI reasoning capabilities, even in the absence of human-level supervision.
The Challenge of Superintelligent Models
Traditional AI training paradigms often rely on human-annotated data or guidance from more advanced models.
However, as large language models (LLMs) become more sophisticated, they can surpass the capabilities of their human supervisors, making traditional supervision methods inadequate.
This research tackles the core question: Can less capable models effectively guide the development of stronger, more advanced models in complex reasoning tasks?
Methodology: A Progressive Refinement Framework
The proposed solution is a progressive learning framework that enables a strong model to autonomously refine its training data.
This two-stage process involves supervised fine-tuning on a small, high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself.
Stage I: Supervised Fine-Tuning
In the first stage, the strong model is fine-tuned using a selectively curated dataset. This dataset is derived from a combination of data generated by a weak model and data self-generated by the strong model through in-context learning.
The key innovation here is the use of final answer consistency to filter and select high-quality data for fine-tuning.
- Full Weak Fine-Tuning: Initially, the strong model is fine-tuned on the entire dataset generated by the weak model. While this approach shows some improvement, more is needed for complex reasoning tasks.
- Weak In-Context Learning (Weak-ICL): The strong model uses a few-shot learning approach with weak model demonstrations to generate its solutions. These solutions are then compared with those from the weak model. Only data with consistent final answers are selected for fine-tuning, significantly enhancing the model's performance.
- Iterative Training: This process can be repeated iteratively, using the strong model's enhanced capabilities to further refine the training data and improve performance.
Stage II: Preference Optimization
Once the strong model has achieved a certain level of proficiency, the second stage involves preference optimization.
Here, the model learns from its own mistakes and the errors of the weak model by constructing contrastive samples.
Using techniques like Direct Preference Optimization (DPO) and its variant ORPO, the model distinguishes between correct and incorrect solutions, further enhancing its reasoning capabilities.
Key Findings and Results
The proposed weak-to-strong learning framework was tested on two well-known mathematical reasoning datasets, GSM8K and MATH, using Llama2-70b as the strong model and three different weak models: Llama2-7b, Gemma-2b, and Mistral-7b.
The results are impressive:
- Significant Performance Improvement: The strong model, when supervised by the weak Gemma-2b, improved its performance on GSM8K by 26.99 points after the first stage of training. Further preference optimization added an additional 8.49 points.
- Enhanced Reasoning Capabilities: The strong model fine-tuned using the proposed method outperformed naive fine-tuning approaches, achieving higher accuracy and robustness in reasoning tasks.
- Effective Data Utilization: The final answer consistency method proved effective in selecting high-quality data, which is crucial for the model's learning process.
Implications and Future Applications
The implications of this research are profound. By leveraging weak-to-strong learning, we can unlock the latent capabilities of superintelligent models without relying on human-annotated data.
This approach paves the way for more scalable and sophisticated strategies to enhance AI reasoning powers, crucial for solving complex real-world problems.
Potential Applications:
- Educational AI: Enhancing the reasoning capabilities of educational AI systems to provide more accurate and insightful feedback to students.
- Scientific Research: Assisting in solving complex mathematical and scientific problems that are currently beyond human reach.
- Automated Decision Making: Improving the decision-making processes in various industries, from finance to healthcare, by leveraging advanced reasoning capabilities.
Conclusion
The weak-to-strong learning framework represents a significant advancement in the field of AI, addressing the challenge of training models that exceed human capabilities.
By allowing strong models to autonomously refine their training data and learn from both their own and the weak models' mistakes, this approach offers a robust and scalable solution for enhancing AI reasoning capabilities.
As we continue to push the boundaries of what AI can achieve, methodologies like this will be crucial in guiding us toward the realization of true superintelligence.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →