DeepGEMM: Clean and Efficient FP8 GEMM Kernels with Fine-Grained Scaling
2025-02-26
DeepGEMM is a library for clean and efficient FP8 General Matrix Multiplications (GEMMs) on NVIDIA Hopper Tensor Cores, featuring fine-grained scaling as proposed in DeepSeek-V3. Supporting both normal and Mix-of-Experts (MoE) grouped GEMMs, it uses a lightweight Just-In-Time (JIT) compiler, eliminating the need for compilation during installation. It tackles the imprecision of FP8 tensor core accumulation via CUDA-core two-level accumulation (promotion). Despite its concise design (~300 lines of core code), DeepGEMM's performance matches or surpasses expert-tuned libraries across various matrix shapes.
Development