Conquering Nondeterminism in LLM Inference

The irreproducibility of large language model (LLM) inference results is a persistent problem. This post delves into the root cause, revealing it's not simply floating-point non-associativity and concurrent execution, but rather the lack of "batch invariance" in kernel implementations. Even if individual kernels are deterministic, nondeterministic variations in batch size (due to server load) affect the final output. The authors analyze the challenges of achieving batch invariance in RMSNorm, matrix multiplication, and attention mechanisms, proposing a method to eliminate nondeterminism by modifying kernel implementations. This leads to fully reproducible LLM inference and positive impacts on reinforcement learning training.