Beyond cuBLAS and CUTLASS: A Novel Matrix Multiplication Kernel Engine

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Beyond cuBLAS and CUTLASS: A Novel Matrix Multiplication Kernel Engine

2025-07-19

Matrix multiplication is central to modern computing, especially in AI where its speed directly impacts model capabilities. While hardware accelerators like NVIDIA's Tensor Cores are efficient, they lack flexibility. This paper introduces CubeCL, a new engine that generates optimized matrix multiplication kernels across platforms. CubeCL uses a hierarchical abstraction (Tile, Stage, Global, Batch Matmul) and various algorithms (Simple, Double Buffering, Ordered, etc.) to achieve this. It cleverly leverages GPU architectural features like plane-synchronous execution and coalesced memory access, employing techniques like double buffering to hide memory latency. Benchmarks show significant performance improvements on various GPUs (NVIDIA, AMD, and Apple Silicon), even surpassing cuBLAS and CUTLASS in some cases.

(burn.dev)

Development

Stack Overflow Says Goodbye to Physical Servers: The Cloud Journey Begins

HALO Deals: A New Acquisition Model in AI