Original Paper: https://arxiv.org/abs/2306.00978
By: Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han
Abstract:
Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.
Summary Notes
Figure: We introduce AWQ, a versatile weight quantization method for LLM. To implement AWQ, we developed TinyChat to deploy 4-bit quantized LLMs into various edge platforms, achieving a 3-4× performance boost compared to FP16. Notably, we’ve also manufactured a TinyChat computer, powered by TinyChat, which contains an NVIDIA Jetson Orin Nano with only 8GB of memory and 15W power consumption.
Large Language Models (LLMs) have fundamentally transformed numerous AI applications, but their deployment on edge devices presents significant challenges due to their immense size and the limited hardware resources available. A new approach called Activation-aware Weight Quantization (AWQ) promises to address these challenges, enabling efficient on-device LLM deployment without sacrificing performance.
Introduction
Deploying LLMs on edge devices is crucial for reducing cloud computing costs and enhancing user privacy by keeping data local. However, the large size of these models makes on-device deployment difficult. AWQ offers a solution by effectively compressing LLMs through low-bit weight quantization. This method identifies and protects the most important weights in the model, significantly reducing quantization errors.
Key Methodologies
Understanding Weight Importance
AWQ is based on the observation that not all weights in an LLM are equally important. By focusing on a small fraction (0.1%-1%) of salient weights, which are identified based on the activation distribution rather than the weight distribution, AWQ can greatly reduce quantization errors. This insight is crucial as it shifts the focus from the traditional approach of looking at weight magnitudes to considering activation magnitudes.
Scaling Salient Weights
To avoid the inefficiency of mixed-precision quantization, AWQ employs a mathematical approach to scale up the salient weight channels. This scaling is determined by collecting activation statistics offline, ensuring that the model preserves its generalization ability across different domains and modalities without overfitting to the calibration set.
TinyChat: The Inference Framework
Alongside AWQ, TinyChat is introduced as an efficient inference framework tailored for 4-bit on-device LLMs. TinyChat utilizes kernel fusion and platform-aware weight packing to achieve significant speedups, making it possible to deploy large models like the 70B Llama-2 on mobile GPUs.
Main Findings and Results
Superior Quantization Performance
Experiments show that AWQ outperforms existing quantization methods on various benchmarks, including language modeling and domain-specific tasks like coding and math. For instance, AWQ achieved better perplexity scores compared to round-to-nearest (RTN) and GPTQ on LLaMA and Llama-2 models across different sizes (7B, 13B, and 70B).
Efficient on Edge Devices
TinyChat, when combined with AWQ, translates theoretical memory savings into measured speedup. On desktop, laptop, and mobile GPUs, TinyChat consistently achieved a 3.2-3.3× average speedup over the FP16 implementation by Huggingface across diverse LLMs. This demonstrates the practical viability of deploying large models on edge devices like NVIDIA Jetson Orin and Apple M1.
Robustness and Generalization
One of the significant advantages of AWQ is its robustness to the calibration set distribution. Unlike other methods that may overfit to the calibration set, AWQ maintains performance across different domains and modalities. This robustness is demonstrated in experiments where AWQ required 10× fewer calibration sequences compared to GPTQ to achieve comparable performance.
Implications and Applications
Real-World Applications
The ability to deploy LLMs efficiently on edge devices opens up numerous applications. For instance, virtual assistants and chatbots can operate offline, providing real-time responses without the need for cloud connectivity. Autonomous vehicles can process data locally, reducing latency and improving safety.
Democratizing AI
AWQ and TinyChat democratize the deployment of large models on a wide range of devices, from powerful GPUs to resource-constrained IoT devices like Raspberry Pi. This flexibility ensures that advanced AI capabilities are accessible to a broader audience, fostering innovation across various fields.
Conclusion
Activation-aware Weight Quantization (AWQ) represents a significant advancement in the quest to deploy LLMs on edge devices. By intelligently identifying and protecting the most important weights, AWQ achieves impressive quantization performance without compromising the model's generalization ability. Coupled with TinyChat, this approach offers a practical solution for bringing the power of large language models to a wide array of applications, paving the way for more efficient and accessible AI.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →