SepLLM: Inference Acceleration for LLMs by Compressing Meaningless Tokens
2025-03-06

Large Language Models (LLMs) face significant challenges due to their massive computational demands. Researchers discovered that certain meaningless special tokens contribute disproportionately to attention scores. Based on this, they propose SepLLM, a framework that accelerates inference by compressing segments between these tokens and dropping redundant ones. Experiments show SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark with negligible performance loss using Llama-3-8B. In streaming settings, SepLLM effectively handles language modeling with up to 4 million tokens or more.