Why LLMs Catastrophically Fail on Long Conversations: Attention Sinks and StreamingLLM

2025-08-09

Researchers discovered why large language models (LLMs) catastrophically fail on long conversations: removing old tokens to save memory causes models to produce complete gibberish. They found models dump massive attention onto the first few tokens as "attention sinks" – places to park unused attention since softmax requires weights to sum to 1. Their solution, StreamingLLM, simply keeps the first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models. OpenAI's open-source models also utilize a similar attention sink mechanism, highlighting the practical impact of this research.

Read more
AI

SVDQuant: 3x Speedup on Blackwell GPUs with NVFP4

2025-02-22

MIT researchers have developed SVDQuant, a novel 4-bit quantization paradigm that leverages a low-rank branch to absorb outliers, resulting in significant performance gains on NVIDIA's Blackwell GPU architecture. Using the NVFP4 format, SVDQuant achieves better image quality than INT4 and is 3x faster than BF16, with a 3.5x reduction in memory usage. The research is open-sourced and includes an interactive demo.

Read more