llm-d: A Kubernetes-Native Distributed LLM Inference Framework
llm-d is a Kubernetes-native, high-performance distributed Large Language Model (LLM) inference framework offering a streamlined path to serving LLMs at scale. It boasts the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. Utilizing cutting-edge distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with Kubernetes operational tooling in Inference Gateway (IGW), llm-d enables users to operationalize generative AI deployments with a modular, high-performance, end-to-end serving solution. Unlike traditional scaling approaches, llm-d is optimized for the unique characteristics of LLM inference, such as slow, non-uniform, expensive requests, achieving superior performance. Through cache-aware routing, task disaggregation, and adaptive scaling, llm-d significantly improves throughput and efficiency, reduces latency, and supports diverse Quality of Service requirements.