Optimizing a Matrix Multiply Kernel in CUDA with Tensor Cores
2025-04-19
This post details the author's journey to write an optimized matrix multiplication kernel in CUDA using tensor cores on an NVIDIA Tesla T4 GPU. The goal was to compute D = α * A * B + β * C as fast as possible. Through iterative optimization of six kernels, the author achieved performance comparable to NVIDIA's cuBLAS hgemm, highlighting techniques such as hierarchical tiling, memory hierarchy exploitation, data reuse, overlapping computation with data movement, and efficient Tensor Core usage. The author shares insights gained from profiling and optimization, emphasizing the importance of arithmetic intensity and memory bandwidth.
Development
Tensor Cores