Native Sparse Attention: Hardware-Aligned and Natively Trainable
2025-08-02

Long-context modeling remains a challenge in NLP. This ACL 2025 paper introduces NSA, a Natively trained Sparse Attention mechanism. NSA cleverly combines algorithmic innovations with hardware-aligned optimizations. Using a dynamic hierarchical sparse strategy (coarse-grained token compression and fine-grained token selection), it achieves significant efficiency gains while preserving global context awareness and local precision. NSA enables end-to-end training, reducing pre-training costs, and matches or exceeds Full Attention models across benchmarks, showing substantial speedups on 64k-length sequences in decoding, forward, and backward propagation.
Read more