Extreme Compression of Large Language Models via Additive Quantization
Original Paper: https://arxiv.org/abs/2401.06118v3
By: Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
Abstract:
The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices.
In this paper, we revisit the problem of ``extreme'' LLM compression -- defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter -- from the point of view of classic methods in Multi-Codebook Quantization (MCQ).
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations:
1) learned additive quantization of weight matrices in input-adaptive fashion
2) joint optimization of codebook parameters across each transformer blocks.
Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime.
In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.
Summary Notes
Figure: Comparison of AQLM (2-bit) relative to the stateof-the-art QuIP# (2-bit) and the original 16-bit weights on LLAMA 2 7, 13, and 70B models.
Introduction
In the fast-paced world of machine learning, the emergence of highly accurate large language models (LLMs) has sparked a race to develop efficient quantization techniques.
These techniques are essential for enabling the execution of LLMs on end-user devices, where computational and memory resources are limited.
In their recent paper, Egiazarian et al. revisit the challenge of "extreme" LLM compression—defined as reducing the bit counts to as low as 2-3 bits per parameter.
Their proposed algorithm, Additive Quantization of Language Models (AQLM), offers groundbreaking improvements in this domain.
Key Methodologies
The primary objective of the research is to compress LLMs to extremely low bit-widths while preserving model accuracy. The AQLM algorithm leverages two main innovations:
- Learned Additive Quantization: This technique involves the adaptive quantization of weight matrices in an input-sensitive manner.
- Joint Optimization of Codebook Parameters: AQLM jointly optimizes codebook parameters across each transformer block, enhancing the compression efficiency.
Additive Quantization (AQ) is a classic approach used in information retrieval to compress databases of vectors for efficient search operations. AQLM extends AQ to LLM weight compression, preserving the outputs of each layer and transformer block.
Main Findings and Results
The research paper presents several significant findings:
- Pareto Optimality: AQLM achieves Pareto optimality in terms of accuracy vs. model size when compressing to less than 3 bits per parameter.
- Performance: It significantly outperforms all known schemes, especially in the extreme compression (2-bit) regime.
- Practical Implementations: The authors provide efficient GPU and CPU implementations of AQLM, capable of matching or outperforming optimized FP16 implementations in terms of speed while operating within a much smaller memory footprint.
Implications and Potential Applications
The implications of this research are profound. By enabling the execution of LLMs with extremely low bit-widths, AQLM can facilitate the deployment of powerful language models on a wide range of devices, including mobile phones and edge devices.
This could democratize access to advanced AI capabilities, bringing sophisticated language understanding and generation to everyday applications.
Potential applications include:
- Edge AI: Deploying LLMs on edge devices for tasks like speech recognition, translation, and conversational AI.
- Embedded Systems: Integrating LLMs into IoT devices for smart home automation and other intelligent systems.
- Mobile Applications: Enhancing the capabilities of mobile apps with advanced language processing features without the need for cloud-based computation.
Conclusion
The AQLM algorithm marks a significant leap forward in the field of LLM compression. By achieving extreme compression ratios with minimal loss in accuracy,
AQLM paves the way for the widespread deployment of powerful language models on resource-constrained devices. This research not only advances the state-of-the-art in model compression but also opens up new possibilities for the practical application of LLMs in everyday technology.
References
Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2023). Extreme Compression of Large Language Models via Additive Quantization.