Zing Forum

Reading

llm-d: A Production-Grade LLM Inference Optimization Stack on Kubernetes

llm-d is a high-performance distributed inference service stack for Kubernetes. It leverages technologies like intelligent scheduling, prefill/decode separation, expert parallelism, and hierarchical KV caching to help users achieve state-of-the-art inference performance for open-source large models on modern accelerators.

LLM推理KubernetesvLLM分布式系统模型服务GPU优化MoE自动扩缩容
Published 2026-04-02 21:42Recent activity 2026-04-02 21:50Estimated read 6 min
llm-d: A Production-Grade LLM Inference Optimization Stack on Kubernetes
1

Section 01

llm-d: A Production-Grade LLM Inference Optimization Stack on Kubernetes (Introduction)

llm-d is a high-performance distributed inference service stack for Kubernetes. By combining technologies such as intelligent scheduling, prefill/decode separation, expert parallelism, and hierarchical KV caching with model servers like vLLM, it achieves advanced inference performance for open-source large models on modern accelerators, addressing challenges like high concurrency, multi-tenancy, and heterogeneous hardware in production environments.

2

Section 02

Project Background and Positioning

Currently, model servers like vLLM and SGLang can run large language models efficiently, but production environments face demands such as high concurrency, multi-tenancy, heterogeneous hardware, and cost optimization, requiring a more intelligent orchestration layer above model servers. llm-d does not reinvent model servers; instead, it provides advanced orchestration capabilities to enable efficient services like vLLM to handle large-scale real traffic.

3

Section 03

Detailed Explanation of Core Optimization Technologies

Intelligent Inference Scheduling

Deploy an Envoy proxy-based intelligent load balancer that supports prefix cache-aware routing, utilization-based load balancing, multi-tenant fairness and priority, and predictive latency balancing (experimental).

Prefill/Decode Separation

Separate prefill (prompt processing) and decode (token generation) into independent instance clusters, reducing Time to First Token (TTFT) and Time per Output Token (TPOT). It can reach 50k output tokens per second under a 16×16 B200 topology.

Wide Expert Parallelism

For MoE models like DeepSeek-R1, implement a combination of data parallelism and expert parallelism, achieving a decoding performance of approximately 3.1k tokens per second on B200.

Hierarchical KV Prefix Caching

Offload KV caching to CPU memory, local SSD, and remote file systems to improve hit rates and reduce inference costs.

Auto-scaling

Provide two solutions: Workload Variant Autoscaler (for heterogeneous hardware and multi-models) and HPA with IGW metrics (for independent scaling on homogeneous hardware).

4

Section 04

Technical Architecture and Ecosystem Integration

llm-d is deeply integrated with the open-source ecosystem: it uses vLLM as the default model server (the team contributes upstream optimizations); Kubernetes Inference Gateway serves as the control plane API and load balancing orchestrator; Envoy proxy provides scalable load balancing strategies; NIXL supports fast interconnection (IB/RoCE RDMA, etc.) for KV cache transmission.

5

Section 05

Version Evolution and Performance Data

v0.5 (February 2026):Introduced benchmarking, hierarchical KV offloading, cache-aware LoRA routing, etc. Decoding performance on B200 is approximately 3.1k tokens per second, and the 16×16 topology reaches 50k output tokens per second. v0.4 (December 2025):Reduced DeepSeek V3.1 latency by 40% on H200, added support for Intel XPU and Google TPU separation, and offloaded prefix caching to vLLM's native CPU memory hierarchy.

6

Section 06

Quick Start and Production Deployment

Provides detailed quick start guides and Helm charts. New users can start by deploying the inference scheduler and vLLM combination. Production deployment emphasizes observability and elasticity, offering monitoring metrics, health checks, and multi-replica semantic support.

7

Section 07

Summary and Outlook

llm-d represents the evolution of LLM inference optimization from single-point technologies to a systematic platform. It integrates various technologies into a Kubernetes-native architecture, providing a high-performance foundation for large model services in production environments. Teams building or optimizing LLM inference infrastructure are advised to conduct in-depth research and trial use.