Why LLMs Catastrophically Fail on Long Conversations: Attention Sinks and StreamingLLM

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Why LLMs Catastrophically Fail on Long Conversations: Attention Sinks and StreamingLLM

2025-08-09

Researchers discovered why large language models (LLMs) catastrophically fail on long conversations: removing old tokens to save memory causes models to produce complete gibberish. They found models dump massive attention onto the first few tokens as "attention sinks" – places to park unused attention since softmax requires weights to sum to 1. Their solution, StreamingLLM, simply keeps the first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models. OpenAI's open-source models also utilize a similar attention sink mechanism, highlighting the practical impact of this research.

(hanlab.mit.edu)

Budapest's Telefon Hírmondó: The First Telephone Newspaper?

Poltergeist: The Ghost That Keeps Your Builds Fresh