Megakernels: Smashing LLM Inference Latency

2025-05-28
Megakernels: Smashing LLM Inference Latency

To boost the speed of large language models (LLMs) in low-latency applications like chatbots, researchers developed a 'megakernel' technique. This fuses the forward pass of a Llama-1B model into a single kernel, eliminating the overhead of kernel boundaries and memory pipeline stalls inherent in traditional multi-kernel approaches. Results show significant speed improvements on H100 and B200 GPUs, outperforming existing systems by over 1.5x and achieving drastically lower latency.