research-papers

Extreme Compression of Large Language Models via Additive Quantization

Athina AI

08 Jun 2024 — 3 min read

Photo by fabio / Unsplash

Original Paper: https://arxiv.org/abs/2401.06118v3

By: Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

Abstract:

The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices.

In this paper, we revisit the problem of ``extreme'' LLM compression -- defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter -- from the point of view of classic methods in Multi-Codebook Quantization (MCQ).

Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations:

1) learned additive quantization of weight matrices in input-adaptive fashion

2) joint optimization of codebook parameters across each transformer blocks.

Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime.

In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.

Summary Notes

Figure: Comparison of AQLM (2-bit) relative to the stateof-the-art QuIP# (2-bit) and the original 16-bit weights on LLAMA 2 7, 13, and 70B models.

Introduction

In the fast-paced world of machine learning, the emergence of highly accurate large language models (LLMs) has sparked a race to develop efficient quantization techniques.

These techniques are essential for enabling the execution of LLMs on end-user devices, where computational and memory resources are limited.

In their recent paper, Egiazarian et al. revisit the challenge of "extreme" LLM compression—defined as reducing the bit counts to as low as 2-3 bits per parameter.

Their proposed algorithm, Additive Quantization of Language Models (AQLM), offers groundbreaking improvements in this domain.

Key Methodologies

The primary objective of the research is to compress LLMs to extremely low bit-widths while preserving model accuracy. The AQLM algorithm leverages two main innovations:

Learned Additive Quantization: This technique involves the adaptive quantization of weight matrices in an input-sensitive manner.
Joint Optimization of Codebook Parameters: AQLM jointly optimizes codebook parameters across each transformer block, enhancing the compression efficiency.

Additive Quantization (AQ) is a classic approach used in information retrieval to compress databases of vectors for efficient search operations. AQLM extends AQ to LLM weight compression, preserving the outputs of each layer and transformer block.

Main Findings and Results

The research paper presents several significant findings:

Pareto Optimality: AQLM achieves Pareto optimality in terms of accuracy vs. model size when compressing to less than 3 bits per parameter.
Performance: It significantly outperforms all known schemes, especially in the extreme compression (2-bit) regime.
Practical Implementations: The authors provide efficient GPU and CPU implementations of AQLM, capable of matching or outperforming optimized FP16 implementations in terms of speed while operating within a much smaller memory footprint.

Implications and Potential Applications

The implications of this research are profound. By enabling the execution of LLMs with extremely low bit-widths, AQLM can facilitate the deployment of powerful language models on a wide range of devices, including mobile phones and edge devices.

This could democratize access to advanced AI capabilities, bringing sophisticated language understanding and generation to everyday applications.

Potential applications include:

Edge AI: Deploying LLMs on edge devices for tasks like speech recognition, translation, and conversational AI.
Embedded Systems: Integrating LLMs into IoT devices for smart home automation and other intelligent systems.
Mobile Applications: Enhancing the capabilities of mobile apps with advanced language processing features without the need for cloud-based computation.

Conclusion

The AQLM algorithm marks a significant leap forward in the field of LLM compression. By achieving extreme compression ratios with minimal loss in accuracy,

AQLM paves the way for the widespread deployment of powerful language models on resource-constrained devices. This research not only advances the state-of-the-art in model compression but also opens up new possibilities for the practical application of LLMs in everyday technology.

References

Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2023). Extreme Compression of Large Language Models via Additive Quantization.

Extreme Compression of Large Language Models via Additive Quantization

Athina AI

Abstract:

Summary Notes

Introduction

Key Methodologies

Main Findings and Results

Implications and Potential Applications

Conclusion

References

Read more

How a Founder ran 100+ Voice Interviews in 48 Hours — without a Single Zoom Call, Powered by Dialog

Top 10 AI Agent Papers of the Week: 10th April - 18th April

Top 10 AI Agent Papers of the Week: 1st April - 8th April

Top 10 AI Agents Papers from March 2025