Tensor Product Attention: All You Need

2025-01-22
Tensor Product Attention: All You Need

Scaling language models to handle longer input sequences typically requires large key-value (KV) caches, resulting in substantial memory overhead during inference. This paper proposes Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly reducing KV cache size during inference. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA improves model quality while maintaining memory efficiency. Based on TPA, the authors introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Extensive empirical evaluation on language modeling tasks demonstrates that T6 surpasses standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of well-known evaluation benchmarks. Notably, TPA's memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. Code is available.