Adventures in Implementing Flash Attention in CUDA C++

2025-08-23

This post details the author's journey in implementing and optimizing Flash Attention in CUDA C++. Starting with a basic implementation, the author progressively refines the kernel using techniques like shared memory swizzling, two-stage pipelining, and more efficient ldmatrix usage. Through iterative profiling and optimization, the final implementation achieves near hardware-theoretical-limit performance. The post also delves into the intricacies of online softmax implementation and resolving shared memory bank conflicts, providing valuable insights for CUDA C++ developers.

Development