DeepEP: A High-Performance Communication Library for Mixture-of-Experts

2025-02-25
DeepEP: A High-Performance Communication Library for Mixture-of-Experts

DeepEP is a communication library designed for Mixture-of-Experts (MoE) and expert parallelism (EP), offering high-throughput and low-latency all-to-all GPU kernels (MoE dispatch and combine). It supports low-precision operations, including FP8. Optimized for the group-limited gating algorithm in DeepSeek-V3, DeepEP provides kernels for asymmetric-domain bandwidth forwarding (e.g., NVLink to RDMA). These kernels achieve high throughput, suitable for training and inference prefilling. SM (Streaming Multiprocessors) number control is also supported. For latency-sensitive inference decoding, low-latency kernels using pure RDMA minimize delays. A hook-based communication-computation overlap method is included, requiring no SM resources. The library is tested with InfiniBand and is theoretically compatible with RoCE.

Development GPU Communication