KV Cache Tricks for Faster Language Models

2025-01-28
KV Cache Tricks for Faster Language Models

The slow speed of large language models (LLMs) in text generation stems from the computational complexity of self-attention. This article explores KV caching and its optimization techniques. KV caching stores key-value pairs for each token to avoid redundant computation, reducing complexity from O(n³) to O(n²); however, memory consumption remains substantial. The article delves into 11 papers proposing optimizations: token selection and pruning based on attention scores, post-hoc compression techniques, and architectural redesigns such as Multi-head Latent Attention (MLA). These aim to balance memory usage and computational efficiency, ultimately making models like ChatGPT generate text faster and more efficiently.