Original Paper: https://arxiv.org/abs/2407.14057
By: Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
Abstract
The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.
Summary Notes
Figure 3:Comparison between standard LLM and LazyLLM. Instead of computing the KV cache of all input tokens at the prefilling stage, LazyLLM only selectively computes the tokens that are important to the next token prediction, deferring the computation of remaining tokens to later steps. LazyLLM significantly optimizes TTFT by reducing the amount of computation during prefilling. Moreover, as some tokens in the prompt are never selected by LazyLLM during the whole generation process (even though theoretically the model could use all tokens in the prompt), LazyLLM also reduces the total amount of computation and accelerates the overall generation.
Introduction
In the fast-evolving world of large language models (LLMs), the efficiency of model inference is becoming increasingly critical.
One of the main challenges in transforming-based LLMs, especially under long-context scenarios, is the significant time delay in generating the first token, known as the "time-to-first-token" (TTFT).
This delay can be a bottleneck in many applications, affecting user experience and overall system performance.
To address this issue, a novel technique called LazyLLM has been introduced, which dynamically prunes tokens to optimize the inference process, thereby reducing TTFT and improving overall efficiency without compromising accuracy.
Key Methodologies
Understanding LLM Inference
LLM inference typically consists of two sequential stages: prefilling and decoding. During the prefilling stage, the model processes the entire prompt to compute the key-value (KV) cache for all tokens, which is then used to generate the first token.
The subsequent decoding stage iteratively reuses this KV cache to predict the next tokens until the end of the sequence.
The Bottleneck: Time-to-First-Token (TTFT)
For long prompts, the prefilling stage can be particularly time-consuming due to the depth and width of modern transformer architectures.
For instance, the Llama 2 model, with its 32 transformer layers and 7 billion parameters, requires 21 times the walltime of each subsequent decoding step during the prefilling stage, accounting for roughly 23% of the total generation time.
This makes optimizing TTFT a critical path towards efficient LLM inference.
Introducing LazyLLM
LazyLLM addresses the TTFT bottleneck by dynamically pruning tokens during both the prefilling and decoding stages.
Unlike static pruning methods that drop tokens all at once, LazyLLM selectively computes the KV cache only for tokens that are crucial for the next token prediction, deferring the computation of less important tokens to later steps.
This dynamic approach allows the model to adjust the token set at each generation step, which is essential for maintaining accuracy.
Main Findings and Results
Token Importance and Pruning Strategy
LazyLLM leverages the attention scores from previous transformer layers to determine the importance of each token. By progressively pruning tokens that are less important, LazyLLM can significantly reduce the computational load during the prefilling stage.
An auxiliary cache (Aux Cache) is used to store the hidden states of pruned tokens, enabling their efficient retrieval if they become relevant in later steps.
This ensures that each token is computed at most once, maintaining computational efficiency.
Performance Evaluation
Empirical evaluations on 16 standard datasets across various tasks demonstrated that LazyLLM could be integrated with existing LLMs like Llama 2 and XGen without any fine-tuning, significantly accelerating the inference process.
For example, in multi-document question-answering tasks, LazyLLM achieves a 2.34× speedup in the prefilling stage while maintaining accuracy, showcasing its effectiveness and efficiency.
Implications and Potential Applications
Universal Integration
One of the standout features of LazyLLM is its universality. It can be seamlessly integrated with any existing transformer-based LLM, enhancing inference speed without requiring any parameter modifications or fine-tuning.
This makes it a highly versatile solution for a wide range of applications, from chatbots and virtual assistants to more complex AI-driven systems.
Training-Free Efficiency
LazyLLM's training-free nature means that it can be directly applied to existing models, providing immediate benefits in terms of reduced TTFT and overall generation time.
This can lead to more responsive AI systems, improving user experience and enabling more efficient deployment of LLMs in production environments.
Real-World Applications
In real-world scenarios, the ability to dynamically prune tokens can lead to substantial improvements in various applications. For instance:
- Customer Support: Faster response times in chatbots can enhance user satisfaction.
- Content Generation: Reduced latency in generating long-form content can boost productivity.
- Interactive AI Systems: More responsive interactions in virtual assistants and interactive storytelling applications.
Conclusion
LazyLLM represents a significant advancement in the quest for efficient LLM inference. By dynamically pruning tokens based on their importance, LazyLLM reduces the computational burden during the prefilling stage, directly addressing the TTFT bottleneck.
This results in faster, more efficient language model inference without sacrificing accuracy, making it a valuable tool for a wide range of applications.
As LLMs continue to grow in complexity and capability, innovations like LazyLLM will be essential in ensuring that these models remain practical and efficient in real-world deployments.
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →