research-papers
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Original Paper: https://arxiv.org/abs/2311.04934 By: In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong Abstract: We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text