Reading

llm-d: A Production-Grade LLM Inference Optimization Stack on Kubernetes

llm-d is a high-performance distributed inference service stack for Kubernetes. It leverages technologies like intelligent scheduling, prefill/decode separation, expert parallelism, and hierarchical KV caching to help users achieve state-of-the-art inference performance for open-source large models on modern accelerators.

LLM推理KubernetesvLLM分布式系统模型服务GPU优化MoE自动扩缩容

Published 2026-04-02 21:42Recent activity 2026-04-02 21:50Estimated read 6 min

Section 01

llm-d: A Production-Grade LLM Inference Optimization Stack on Kubernetes (Introduction)

llm-d is a high-performance distributed inference service stack for Kubernetes. By combining technologies such as intelligent scheduling, prefill/decode separation, expert parallelism, and hierarchical KV caching with model servers like vLLM, it achieves advanced inference performance for open-source large models on modern accelerators, addressing challenges like high concurrency, multi-tenancy, and heterogeneous hardware in production environments.

Section 02

Project Background and Positioning

Currently, model servers like vLLM and SGLang can run large language models efficiently, but production environments face demands such as high concurrency, multi-tenancy, heterogeneous hardware, and cost optimization, requiring a more intelligent orchestration layer above model servers. llm-d does not reinvent model servers; instead, it provides advanced orchestration capabilities to enable efficient services like vLLM to handle large-scale real traffic.

Section 03

Detailed Explanation of Core Optimization Technologies

Intelligent Inference Scheduling

Deploy an Envoy proxy-based intelligent load balancer that supports prefix cache-aware routing, utilization-based load balancing, multi-tenant fairness and priority, and predictive latency balancing (experimental).

Prefill/Decode Separation

Separate prefill (prompt processing) and decode (token generation) into independent instance clusters, reducing Time to First Token (TTFT) and Time per Output Token (TPOT). It can reach 50k output tokens per second under a 16×16 B200 topology.

Wide Expert Parallelism

For MoE models like DeepSeek-R1, implement a combination of data parallelism and expert parallelism, achieving a decoding performance of approximately 3.1k tokens per second on B200.

Hierarchical KV Prefix Caching

Offload KV caching to CPU memory, local SSD, and remote file systems to improve hit rates and reduce inference costs.

Auto-scaling

Provide two solutions: Workload Variant Autoscaler (for heterogeneous hardware and multi-models) and HPA with IGW metrics (for independent scaling on homogeneous hardware).

Section 04

Technical Architecture and Ecosystem Integration

llm-d is deeply integrated with the open-source ecosystem: it uses vLLM as the default model server (the team contributes upstream optimizations); Kubernetes Inference Gateway serves as the control plane API and load balancing orchestrator; Envoy proxy provides scalable load balancing strategies; NIXL supports fast interconnection (IB/RoCE RDMA, etc.) for KV cache transmission.

Section 05

Version Evolution and Performance Data

v0.5 (February 2026)：Introduced benchmarking, hierarchical KV offloading, cache-aware LoRA routing, etc. Decoding performance on B200 is approximately 3.1k tokens per second, and the 16×16 topology reaches 50k output tokens per second. v0.4 (December 2025)：Reduced DeepSeek V3.1 latency by 40% on H200, added support for Intel XPU and Google TPU separation, and offloaded prefix caching to vLLM's native CPU memory hierarchy.

Section 06

Quick Start and Production Deployment

Provides detailed quick start guides and Helm charts. New users can start by deploying the inference scheduler and vLLM combination. Production deployment emphasizes observability and elasticity, offering monitoring metrics, health checks, and multi-replica semantic support.

Section 07

Summary and Outlook

llm-d represents the evolution of LLM inference optimization from single-point technologies to a systematic platform. It integrates various technologies into a Kubernetes-native architecture, providing a high-performance foundation for large model services in production environments. Teams building or optimizing LLM inference infrastructure are advised to conduct in-depth research and trial use.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15