Tokasaurus: A New LLM Inference Engine for High Throughput
2025-06-05

Stanford researchers released Tokasaurus, a novel LLM inference engine optimized for throughput-intensive workloads. For smaller models, Tokasaurus leverages extremely low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, it supports async tensor parallelism for NVLink-equipped GPUs and a fast pipeline parallelism implementation for those without. On throughput benchmarks, Tokasaurus outperforms vLLM and SGLang by up to 3x. This engine is designed for efficient handling of both large and small models, offering significant performance advantages.
Read more