FlashMLA: A Blazing-Fast MLA Decoding Kernel for Hopper GPUs
2025-02-24
FlashMLA is a highly efficient MLA decoding kernel optimized for Hopper GPUs, designed for variable-length sequence serving. Achieving up to 3000 GB/s in memory-bound configurations and 580 TFLOPS in computation-bound configurations on H800 SXM5 using CUDA 12.6, FlashMLA utilizes BF16 precision and a paged kvcache with a 64 block size. Inspired by FlashAttention 2&3 and the cutlass projects, FlashMLA offers significant performance improvements for large-scale sequence processing.
Development
MLA decoding