Why LLMs Catastrophically Fail on Long Conversations: Attention Sinks and StreamingLLM

2025-08-09

Researchers discovered why large language models (LLMs) catastrophically fail on long conversations: removing old tokens to save memory causes models to produce complete gibberish. They found models dump massive attention onto the first few tokens as "attention sinks" – places to park unused attention since softmax requires weights to sum to 1. Their solution, StreamingLLM, simply keeps the first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models. OpenAI's open-source models also utilize a similar attention sink mechanism, highlighting the practical impact of this research.

AI