LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Photo by Google DeepMind / Unsplash


Original Paper: https://arxiv.org/abs/2310.05736

By: Huiqiang JiangQianhui WuChin-Yew LinYuqing YangLili Qiu

Abstract:

Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.

With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens.

To accelerate model inference and reduce cost, this paper presents

LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models.

We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at this https URL

Summary Notes

image

LLMLingua: Streamlining Large Language Model Inference with Prompt Compression

In the dynamic realm of artificial intelligence, Large Language Models (LLMs) like GPT-3 are revolutionizing the way machines comprehend and generate human language. However, the complexity and size of these models bring significant computational costs.

Enter LLMLingua, a revolutionary approach that compresses prompts to speed up model inference while maintaining performance. This post explores how LLMLingua works and its benefits, offering insights for AI Engineers in enterprise settings aiming to efficiently use LLMs.

Understanding LLMLingua

LLMLingua introduces a unique coarse-to-fine prompt compression technique, aimed at shortening prompts to save computational resources during model inference. This is crucial as the trend towards using longer prompts for better context and learning grows.

Core Features of LLMLingua:

  • Budget Controller: Dynamically adjusts compression ratios for different prompt parts (instructions, demonstrations, question), ensuring semantic meaning is retained through coarse compression.
  • Iterative Token-level Compression: Focuses on maintaining key information by analyzing token dependencies, preventing loss of crucial details.
  • Distribution Alignment: Aligns the distribution of a smaller model used in initial compression steps with the target LLM through fine-tuning, making sure compressed prompts remain effective.

Performance Validation

LLMLingua was put to the test across tasks like reasoning and summarization on four diverse datasets, showing:

  • Up to 20x compression rates with minimal performance loss.
  • Superiority over existing methods, effectively preserving prompt information and reasoning even at high compression levels.
  • The critical role of distribution alignment for maintaining compressed prompt quality.

Addressing Challenges and Future Directions

While LLMLingua excels, it faces hurdles at very high compression rates on complex tasks, indicating a limit to how much a prompt can be compressed without affecting outcomes. Also, differences in tokenizer length estimation between models could pose issues.

Moving Forward

LLMLingua offers a path towards reducing LLM operational costs without compromising performance, adaptable across various tasks.

It's a step forward in making LLMs more accessible and efficient for broad use. For AI Engineers, this could mean significant cost savings and more applications for LLMs.

Continued research in prompt compression will remain vital for the scalable use of LLMs.

In summary, LLMLingua addresses the growing computational demands of LLMs, opening new avenues for their application. By combining LLM capabilities with compressed prompts, it sets the stage for broader, more efficient AI advancements.

Read more