NVIDIA Dynamo: A High-Throughput, Low-Latency Inference Framework for Generative AI
2025-03-18
NVIDIA introduces Dynamo, a high-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is inference engine agnostic (supporting TRT-LLM, vLLM, SGLang, and others), and incorporates features like disaggregated prefill & decode inference, dynamic GPU scheduling, LLM-aware request routing, accelerated data transfer, and KV cache offloading to maximize GPU throughput and minimize latency. Built in Rust for performance and Python for extensibility, Dynamo is fully open-source.