Section 01
llm-d: A Production-Grade LLM Inference Optimization Stack on Kubernetes (Introduction)
llm-d is a high-performance distributed inference service stack for Kubernetes. By combining technologies such as intelligent scheduling, prefill/decode separation, expert parallelism, and hierarchical KV caching with model servers like vLLM, it achieves advanced inference performance for open-source large models on modern accelerators, addressing challenges like high concurrency, multi-tenancy, and heterogeneous hardware in production environments.