Webtagr - Technology News Summarizer

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Tokasaurus: A New LLM Inference Engine for High Throughput

2025-06-05

Stanford researchers released Tokasaurus, a novel LLM inference engine optimized for throughput-intensive workloads. For smaller models, Tokasaurus leverages extremely low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, it supports async tensor parallelism for NVLink-equipped GPUs and a fast pipeline parallelism implementation for those without. On throughput benchmarks, Tokasaurus outperforms vLLM and SGLang by up to 3x. This engine is designed for efficient handling of both large and small models, offering significant performance advantages.

(scalingintelligence.stanford.edu)

AI LLM inference engine high throughput