Section 01
[Introduction] llm-d Inference Scheduler: A Cloud-Native Intelligent Routing System for LLM Inference
The llm-d Inference Scheduler is an intelligent routing system for large model inference requests built on the Kubernetes Gateway API. Addressing routing challenges in large-scale LLM inference deployments (e.g., traditional load balancing cannot leverage KV cache reuse, Prefill/Decode separation, etc.), it implements intelligent routing decisions via pluggable filters, scorers, and fetchers. It supports advanced features like multi-model deployment, KV cache locality optimization, and Prefill/Decode separation, providing enterprise-grade scheduling capabilities for production-level LLM inference services.