FlashMLA: A Blazing-Fast MLA Decoding Kernel for Hopper GPUs

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

FlashMLA: A Blazing-Fast MLA Decoding Kernel for Hopper GPUs

2025-02-24

FlashMLA is a highly efficient MLA decoding kernel optimized for Hopper GPUs, designed for variable-length sequence serving. Achieving up to 3000 GB/s in memory-bound configurations and 580 TFLOPS in computation-bound configurations on H800 SXM5 using CUDA 12.6, FlashMLA utilizes BF16 precision and a paged kvcache with a 64 block size. Inspired by FlashAttention 2&3 and the cutlass projects, FlashMLA offers significant performance improvements for large-scale sequence processing.

(github.com)

Development MLA decoding

California's Abandoned Power Lines: A Ticking Time Bomb?

Blogger's Retrospective: The Value of Deep Dive Learning Posts