DeepGEMM: Clean and Efficient FP8 GEMM Kernels with Fine-Grained Scaling

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

DeepGEMM: Clean and Efficient FP8 GEMM Kernels with Fine-Grained Scaling

2025-02-26

DeepGEMM is a library for clean and efficient FP8 General Matrix Multiplications (GEMMs) on NVIDIA Hopper Tensor Cores, featuring fine-grained scaling as proposed in DeepSeek-V3. Supporting both normal and Mix-of-Experts (MoE) grouped GEMMs, it uses a lightweight Just-In-Time (JIT) compiler, eliminating the need for compilation during installation. It tackles the imprecision of FP8 tensor core accumulation via CUDA-core two-level accumulation (promotion). Despite its concise design (~300 lines of core code), DeepGEMM's performance matches or surpasses expert-tuned libraries across various matrix shapes.

(github.com)

Development

Gorbachev's Reforms: A Helpless or Reckless Revolution?

Nationwide Blackout Plunges Chile into Darkness