research-papers
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Original Paper: https://arxiv.org/abs/2407.14057 By: Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi Abstract The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token