Zing Forum

Reading

Kubernetes-Native Solution for LLM Inference Clusters: KV-Aware Routing and Sharding Management

A Kubernetes Operator-based LLM inference cluster management system that implements prefill-decode separation, KV cache-aware routing, auto-scaling, and full observability support.

KubernetesOperatorLLM推理KV缓存自动扩缩容云原生vLLM分片路由预填充解码
Published 2026-04-21 22:43Recent activity 2026-04-21 23:21Estimated read 6 min
Kubernetes-Native Solution for LLM Inference Clusters: KV-Aware Routing and Sharding Management
1

Section 01

Core Guide to Kubernetes-Native Solution for LLM Inference Clusters

This article introduces a Kubernetes Operator-based LLM inference cluster management system that addresses challenges in production deployment such as GPU resource management, long conversation context consistency, and elastic scaling through declarative APIs and cloud-native architecture. Key features include prefill-decode separation, KV cache-aware routing, auto-scaling, and full observability support.

2

Section 02

System Background and Dual-Plane Architecture Design

Production-grade deployment of large language models faces challenges like efficient GPU resource management, long conversation context consistency, and elastic scaling. This system adopts a dual-plane architecture separating control plane and data plane: the control plane centers on CRDs and manages cluster state via Operator; the data plane uses a prefill (processing prompts to generate KV cache) and decode (autoregressive token generation) separation design, allowing independent optimization and scaling.

3

Section 03

KV-Aware Routing Ensures Long Conversation Consistency

In long conversation scenarios, the same conversation request needs to access the same KV cache. The system achieves session affinity through a shard map (ShardMap, published as ConfigMap) maintained by the Operator: the Router routes requests to the corresponding Pod based on conversationId, ensuring that the same conversation request always goes to the instance holding the KV cache. When Pods change, the Operator updates the shard map, and the Router adjusts its strategy in real time.

4

Section 04

Auto-Scaling to Handle Load Fluctuations

The system supports scaling signals such as queue depth, token throughput, KV cache hit rate, and GPU memory pressure. When a signal exceeds the threshold, the Operator executes the following process: calculate target replica count → update Decode Deployment → wait for new Pods to be ready → update shard map → notify Router. Currently, ConfigMap is used to simulate signals, with reserved interfaces for integrating real metrics.

5

Section 05

Full-Stack Observability Supports Operation and Maintenance Decisions

The system integrates Prometheus and Grafana to collect multi-dimensional metrics: Router (request latency, success rate, routing distribution), Prefill Worker (batch size, computation latency), Decode Worker (generation latency, token throughput). Grafana provides dashboards for real-time QPS, GPU utilization, shard distribution, etc., and supports alarm configuration.

6

Section 06

Deployment Practice and Scenario Selection Recommendations

The deployment process is concise, using cloud-native toolchains: the environment requires Docker, kubectl, kind, etc. Quick start steps include creating a kind cluster (execute ./hack/kind-create.sh), loading images (./hack/kind-load-images.sh), installing the Operator (./hack/install-kind.sh), and creating an inference cluster (applying the sample CR configuration). To verify the routing function, you can access the Router service via port forwarding and send a test request: curl -X POST localhost:8080/v1/chat/completions
-H 'content-type: application/json'
-d '{"conversationId":"demo-1","messages":[{"role":"user","content":"hi"}]} Applicable scenarios: coexistence of multiple models/versions, long conversation requirements, fine-grained resource management, teams with existing K8s operation and maintenance experience. For simple scenarios, it is recommended to use vLLM/TGI directly; for complex scenarios, choose this solution.