Outperforming cuBLAS: A CUDA Implementation of Single-Precision General Matrix Multiplication

2025-01-18

This article presents a CUDA implementation of single-precision general matrix multiplication (SGEMM) that outperforms cuBLAS in certain scenarios. By cleverly using PTX instructions, asynchronous memory copies, double buffering, and other optimization techniques, the author achieved efficient matrix multiplication, specifically tuned for an NVIDIA RTX 3090. The article details the algorithm design, optimization techniques, and benchmarking methodology, providing valuable experience for CUDA learners.

Read more
Development