Beyond cuBLAS and CUTLASS: A Novel Matrix Multiplication Kernel Engine

2025-07-19
Beyond cuBLAS and CUTLASS: A Novel Matrix Multiplication Kernel Engine

Matrix multiplication is central to modern computing, especially in AI where its speed directly impacts model capabilities. While hardware accelerators like NVIDIA's Tensor Cores are efficient, they lack flexibility. This paper introduces CubeCL, a new engine that generates optimized matrix multiplication kernels across platforms. CubeCL uses a hierarchical abstraction (Tile, Stage, Global, Batch Matmul) and various algorithms (Simple, Double Buffering, Ordered, etc.) to achieve this. It cleverly leverages GPU architectural features like plane-synchronous execution and coalesced memory access, employing techniques like double buffering to hide memory latency. Benchmarks show significant performance improvements on various GPUs (NVIDIA, AMD, and Apple Silicon), even surpassing cuBLAS and CUTLASS in some cases.

Development