Why are some LLMs fast on the cloud, but slow locally?

2025-06-01

This article explores why large language models (LLMs), especially Mixture-of-Experts (MoE) models like DeepSeek-V3, are fast and cheap to serve at scale in the cloud but slow and expensive to run locally. The key lies in batch inference: GPUs excel at large matrix multiplications, and batching multiple user requests significantly improves throughput but increases latency. MoE models and models with many layers particularly rely on batching to avoid pipeline bubbles and underutilization of experts. Cloud providers balance throughput and latency by adjusting batch size (collection window), while local runs usually have only one request, leading to very low GPU utilization. The efficiency of OpenAI's services might stem from superior model architecture, clever inference tricks, or vastly more powerful GPUs.