llm-d: Kubernetes-Native Distributed Inference at Scale

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

llm-d: Kubernetes-Native Distributed Inference at Scale

2025-05-21

llm-d is a Kubernetes-native distributed inference serving stack designed for efficient and cost-effective serving of large language models. It leverages cutting-edge distributed inference optimizations such as KV-cache aware routing and disaggregated serving, integrated with Kubernetes operational tooling in Inference Gateway (IGW). Built on open technologies like vLLM, Kubernetes, and Inference Gateway, llm-d features customizable scheduling, disaggregated serving and caching, and plans for hardware, workload, and traffic-aware autoscaling. Easily installable via a Helm chart, users can also experiment with individual components.

(github.com)

Development distributed inference

TitleBridge: System Requirements and Privacy Policy

DictionaryGames: A Love-Hate Relationship with Vocabulary Building