BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Original Paper: https://arxiv.org/abs/2402.04291

By: Wei HuangYangdong LiuHaotong QinYing LiShiming ZhangXianglong LiuMichele MagnoXiaojuan Qi

Abstract:

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency.

Summary Notes

image

Figure: Perplexity of LLaMA-13B on WikiText2 under different bit-widths. Round-to-nearest (RTN), GPTQ, and PB-LLM (10% weight of INT8) suffer accuracy loss at ultralow bits, facing the sharply increasing perplexity (↓). BiLLM demonstrates exceptional performance under binarization.

Introduction

In the world of natural language processing (NLP), large language models (LLMs) like GPT-3 and LLaMA have set new benchmarks in performance and versatility. However, these models come with a hefty price—immense memory and computational demands. This blog post delves into a groundbreaking approach to alleviate these challenges through a method called BiLLM, which pushes the boundaries of post-training quantization to an unprecedented 1-bit weight representation.

Understanding the Challenge

Current LLMs, such as the 70-billion parameter LLaMA2-70B, require around 150 GB of storage in half-precision (FP16) format. Deploying these models typically necessitates multiple high-end GPUs, making them impractical for memory-constrained environments. Quantization techniques have been employed to reduce the memory footprint of these models, but achieving high accuracy with ultra-low bit-widths remains a significant hurdle.

Introducing BiLLM

BiLLM, or Binarized Large Language Model, presents a novel 1-bit post-training quantization (PTQ) scheme tailored for LLMs. Unlike traditional quantization methods that struggle with maintaining performance at ultra-low bit-widths, BiLLM excels by leveraging the unique weight distribution characteristics of LLMs. It employs a dual strategy: a residual approximation for salient weights and an optimal splitting search for non-salient weights.

Key Methodologies

  1. Residual Approximation for Salient Weights:
    • Hessian-Based Selection: The Hessian matrix is utilized to measure the importance of weight elements. A small fraction of weights with high Hessian values are identified as salient.
    • Structural Selection: Salient weights are selected structurally, focusing on specific columns or rows, to reduce the overhead of additional bitmap storage.
    • Residual Approximation: This strategy involves a recursive computation where the residual error after initial binarization is further binarized, minimizing quantization errors.
  2. Optimal Splitting for Non-Salient Weights:
    • Bell-Shaped Distribution: The non-salient weights typically follow a bell-shaped distribution. An optimal break-point is searched to split these weights into sparse and concentrated areas.
    • Binarization: Each segment is binarized separately to minimize errors. The process includes block-wise error compensation to further reduce quantization inaccuracies.

Main Findings

BiLLM demonstrates exceptional performance across various LLM families and evaluation metrics:

  • Achieves high-accuracy inference with weights averaging just 1.08 bits.
  • Outperforms state-of-the-art (SOTA) quantization methods by significant margins.
  • Enables efficient binarization of a 7-billion parameter LLM within 0.5 hours on a single GPU.

For instance, on the WikiText2 dataset, BiLLM achieves a perplexity of 8.41 with the LLaMA2-70B model, surpassing the performance of many full-precision models.

Implications and Applications

The implications of BiLLM are profound. By reducing the memory and computational requirements of LLMs, it opens the door for deploying these powerful models in edge devices and resource-constrained environments. This can revolutionize applications in real-time language translation, voice assistants, and other NLP tasks where low latency and high efficiency are critical.

Conclusion

BiLLM represents a significant leap forward in the field of model compression and quantization. By pushing the limits of post-training quantization to 1-bit weights, it achieves a remarkable balance between performance and efficiency. As LLMs continue to evolve, innovations like BiLLM will be crucial in making these models more accessible and practical for a wider range of applications.

For those interested in exploring this technology further, the code for BiLLM is available on GitHub: BiLLM GitHub Repository.

References

  • Wei Huang, et al., "BiLLM: Pushing the Limit of Post-Training Quantization for LLMs", Proceedings of the 41st International Conference on Machine Learning, 2024.
Building an AI-powered product or feature?

Athina AI is a collaborative IDE for AI development.

Learn more about how Athina can help your team ship AI 10x faster →