GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Original Paper: https://arxiv.org/abs/2210.17323

By: Elias FrantarSaleh AshkboosTorsten HoeflerDan Alistarh

Abstract:

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000).

Summary Notes

image

Figure: Quantizing OPT models to 4 and BLOOM models to 3 bit precision, comparing GPTQ with the FP16 baseline and round-to-nearest (RTN) (Yao et al., 2022; Dettmers et al., 2022).

Introduction

Generative Pre-trained Transformer (GPT) models have set new benchmarks in various natural language processing tasks. However, their massive size and computational demands have made them impractical for many applications. Even inference with large GPT models can require multiple high-performance GPUs, limiting their accessibility. Enter GPTQ: a novel post-training quantization method that significantly reduces the computational and storage requirements of GPT models without compromising accuracy.

The Challenge

GPT models, such as GPT-3 with 175 billion parameters, require enormous computational resources. For instance, storing the parameters of GPT-3 in a compact float16 format still demands 326GB of memory. This exceeds the capacity of even the highest-end single GPUs, necessitating complex and expensive multi-GPU setups for inference. Existing compression techniques, particularly those involving low-bitwidth quantization, often fail to preserve model accuracy at higher compression rates or require extensive retraining, which is impractical for large models.

Introducing GPTQ: A Breakthrough in Quantization

GPTQ (Generative Pre-trained Transformer Quantization) addresses these challenges with a one-shot weight quantization method based on approximate second-order information. This method is both efficient and precise, enabling the quantization of models with up to 175 billion parameters in approximately four GPU hours. GPTQ reduces the bitwidth of weights to 3 or 4 bits with negligible accuracy degradation.

Key Methodologies

GPTQ employs a layer-wise quantization approach, where the weights of each layer are quantized independently. The process involves:

  1. Layer-Wise Objective: Minimizing the squared error between the full-precision and quantized layer outputs.
  2. Optimal Brain Quantization (OBQ): An iterative method that quantizes weights one-by-one, adjusting the remaining weights to compensate for quantization errors.
  3. Cholesky Reformulation: Enhancing numerical stability by precomputing the necessary Hessian inverse information using Cholesky decomposition.

Main Findings and Results

GPTQ achieves remarkable compression efficiency and accuracy:

  • Compression Gains: GPTQ compresses large models to 3-4 bits per weight with minimal accuracy loss. For instance, it quantizes the OPT-175B model to 3 bits while maintaining a perplexity of 8.68, compared to 8.34 for the full-precision model.
  • Inference on Single GPU: For the first time, GPTQ enables inference of a 175 billion-parameter model on a single NVIDIA A100 GPU, significantly reducing the hardware requirements.
  • Speedups: The quantized models achieve substantial speedups in generative inference tasks, with up to 4.5x speedup on NVIDIA A6000 GPUs.

Implications and Applications

The ability to compress GPT models to such low bitwidths without significant accuracy loss opens up numerous possibilities:

  • Accessibility: Researchers and practitioners can now run state-of-the-art models on more affordable and less powerful hardware.
  • Efficiency: Reduced memory and computational requirements translate to lower operational costs and faster inference times, making real-time applications more feasible.
  • Scalability: GPTQ paves the way for deploying large-scale language models in edge devices and resource-constrained environments.

Conclusion

GPTQ represents a significant advancement in the field of model compression. By enabling accurate and efficient quantization of large GPT models, GPTQ makes these powerful tools more accessible and practical for a wide range of applications. Future work can explore further optimizations, such as activation quantization and enhanced mixed-precision support, to extend the benefits of GPTQ even further.

Quote from the Research Paper: "Our method more than doubles the compression gains relative to prior techniques, allowing us for the first time to execute a 175 billion-parameter model inside a single GPU for generative inference."

Potential Applications:

  • Real-time language translation
  • Advanced virtual assistants
  • Scalable AI-driven analytics

The implementation of GPTQ is available for public use, encouraging further research and development in this promising area.

Call to Action:

Explore the full potential of GPTQ and revolutionize your AI applications by leveraging this cutting-edge quantization technique. Visit GPTQ GitHub Repository for more details and access to the implementation.

Building an AI-powered product or feature?

Athina AI is a collaborative IDE for AI development.

Learn more about how Athina can help your team ship AI 10x faster →