Original Paper: https://arxiv.org/abs/2401.10491
By: Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi
Abstract
While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at url: this https URL
Summary Notes
Figure: Illustration of conventional model fusion techniques (ensemble and weight merging) and our knowledge fusion approach for LLMs (FUSELLM). Different animal icons represent different LLMs, with various species denoting LLMs possessing differing architectures. FUSELLM externalizes the knowledge from multiple LLMs and transfers their capabilities to a target LLM.
In the ever-evolving landscape of natural language processing (NLP), Large Language Models (LLMs) like GPT and LLaMA have set the benchmark for performance across a myriad of tasks. However, the development and deployment of these models come with significant costs, both financially and environmentally. A promising alternative to creating new models from scratch is the fusion of existing LLMs, leveraging their individual strengths into a more potent and efficient model. This blog post delves into a novel approach known as Knowledge Fusion of Large Language Models, explored in a recent study, which aims to achieve precisely this.
Breaking New Ground: The Concept of LLMs Fusion
The core idea behind LLMs fusion is to consolidate the capabilities of different pre-trained models into a single target model. This approach not only reduces redundancy but also enhances the model's performance across tasks like reasoning, commonsense understanding, and code generation. Unlike traditional ensemble methods, which aggregate outputs from multiple models, or weight merging, which assumes uniform architectures, this method offers a more holistic and cost-effective solution.
Methodology: A Glimpse Under the Hood
The research introduces a novel framework called FUSE LLM. This approach embraces the diversity in architectures and functionalities of various LLMs by tapping into their generative distributions. Here's a step-by-step breakdown of the methodology:
- Probabilistic Distribution Perspective: Each source LLM generates probabilistic distributions for a given input, reflecting its inherent understanding and knowledge. These distributions are pivotal in the fusion process.
- Token Alignment: Given the differences in tokenization across LLMs, a strategic alignment is crucial. The study employs a minimum edit distance (MinED) strategy to maximize the alignment of tokens across different models.
- Fusion Strategies: The fusion process involves combining these aligned probabilistic distributions. Two strategies are explored: MinCE, which selects the distribution with the minimum cross-entropy score, and AvgCE, which averages distributions based on their cross-entropy scores.
- Continual Training: The target LLM undergoes lightweight continual training, focusing on minimizing divergence from the fused representation, thereby integrating knowledge from all source models.
Key Findings: Elevating Performance Beyond Individual Models
The empirical evaluation of FUSE LLM demonstrated its effectiveness across several benchmarks:
- Reasoning Tasks: On the Big-Bench Hard (BBH) benchmark, FUSE LLM achieved a notable performance gain of 5.16% over the original Llama-2 model.
- Commonsense Understanding: It outperformed baseline models with a 1.25% improvement on common sense benchmarks, showcasing its ability to tackle complex problems.
- Code Generation: The model exhibited significant improvements in zero-shot code generation tasks, with an average performance gain of 6.36%.
These results underscore the potential of FUSE LLM in harnessing collective knowledge from diverse models, surpassing the capabilities of any single source model.
Implications and Applications: The Road Ahead
The implications of this research are profound, particularly in areas where model efficiency and performance are paramount. By reducing the need to maintain multiple models, FUSE LLM offers a scalable solution that can be adapted for various applications, from intelligent virtual assistants to automated content generation.
Moreover, the approach paves the way for further exploration in the domain of LLMs fusion, especially given the diverse structures and substantial sizes of modern language models. As the field progresses, refinements in token alignment and fusion strategies could further enhance the efficacy of such models.
Conclusion: A New Frontier in NLP
The fusion of LLMs represents a significant stride toward more efficient and powerful NLP models. By externalizing and integrating the collective knowledge of multiple models, FUSE LLM not only enhances performance but also offers a sustainable alternative to traditional model development. As we continue to explore the frontiers of artificial intelligence, such innovations will undoubtedly play a crucial role in shaping the future of technology and its applications.
In summary, the concept of LLMs fusion, as demonstrated by FUSE LLM, opens up new possibilities for creating robust, efficient, and versatile language models. This approach not only addresses the limitations of existing methodologies but also sets the stage for future advancements in the field.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →