llm-d: Kubernetes-Native Distributed Inference at Scale

2025-05-21
llm-d: Kubernetes-Native Distributed Inference at Scale

llm-d is a Kubernetes-native distributed inference serving stack designed for efficient and cost-effective serving of large language models. It leverages cutting-edge distributed inference optimizations such as KV-cache aware routing and disaggregated serving, integrated with Kubernetes operational tooling in Inference Gateway (IGW). Built on open technologies like vLLM, Kubernetes, and Inference Gateway, llm-d features customizable scheduling, disaggregated serving and caching, and plans for hardware, workload, and traffic-aware autoscaling. Easily installable via a Helm chart, users can also experiment with individual components.

Development distributed inference