SGLang: An Open-Source Implementation Matching DeepSeek LLM's Inference Performance

2025-08-29
SGLang: An Open-Source Implementation Matching DeepSeek LLM's Inference Performance

DeepSeek, a popular open-source large language model (LLM), boasts impressive performance. However, its massive size and unique architecture (using Multi-head Latent Attention and Mixture of Experts) demand a sophisticated system for efficient large-scale serving. This blog details how we achieved near-parity with DeepSeek's inference system performance using SGLang. Our implementation, running on 12 nodes (each with 8 H100 GPUs) in the Atlas Cloud, leverages prefill-decode disaggregation and large-scale expert parallelism (EP), reaching 52.3k input tokens/second and 22.3k output tokens/second per node for 2000-token input sequences. This is, to our knowledge, the first open-source implementation to nearly match DeepSeek's reported throughput at scale, at roughly one-fifth the cost of the official DeepSeek Chat API.

Read more