Reading

Kubernetes-Native Solution for LLM Inference Clusters: KV-Aware Routing and Sharding Management

A Kubernetes Operator-based LLM inference cluster management system that implements prefill-decode separation, KV cache-aware routing, auto-scaling, and full observability support.

KubernetesOperatorLLM推理KV缓存自动扩缩容云原生vLLM分片路由预填充解码

Published 2026-04-21 22:43Recent activity 2026-04-21 23:21Estimated read 6 min

Kubernetes-Native Solution for LLM Inference Clusters: KV-Aware Routing and Sharding Management

Section 01

Core Guide to Kubernetes-Native Solution for LLM Inference Clusters

This article introduces a Kubernetes Operator-based LLM inference cluster management system that addresses challenges in production deployment such as GPU resource management, long conversation context consistency, and elastic scaling through declarative APIs and cloud-native architecture. Key features include prefill-decode separation, KV cache-aware routing, auto-scaling, and full observability support.

Section 02

System Background and Dual-Plane Architecture Design

Production-grade deployment of large language models faces challenges like efficient GPU resource management, long conversation context consistency, and elastic scaling. This system adopts a dual-plane architecture separating control plane and data plane: the control plane centers on CRDs and manages cluster state via Operator; the data plane uses a prefill (processing prompts to generate KV cache) and decode (autoregressive token generation) separation design, allowing independent optimization and scaling.

Section 03

KV-Aware Routing Ensures Long Conversation Consistency

In long conversation scenarios, the same conversation request needs to access the same KV cache. The system achieves session affinity through a shard map (ShardMap, published as ConfigMap) maintained by the Operator: the Router routes requests to the corresponding Pod based on conversationId, ensuring that the same conversation request always goes to the instance holding the KV cache. When Pods change, the Operator updates the shard map, and the Router adjusts its strategy in real time.

Section 04

Auto-Scaling to Handle Load Fluctuations

The system supports scaling signals such as queue depth, token throughput, KV cache hit rate, and GPU memory pressure. When a signal exceeds the threshold, the Operator executes the following process: calculate target replica count → update Decode Deployment → wait for new Pods to be ready → update shard map → notify Router. Currently, ConfigMap is used to simulate signals, with reserved interfaces for integrating real metrics.

Section 05

Full-Stack Observability Supports Operation and Maintenance Decisions

The system integrates Prometheus and Grafana to collect multi-dimensional metrics: Router (request latency, success rate, routing distribution), Prefill Worker (batch size, computation latency), Decode Worker (generation latency, token throughput). Grafana provides dashboards for real-time QPS, GPU utilization, shard distribution, etc., and supports alarm configuration.

Section 06

Deployment Practice and Scenario Selection Recommendations

The deployment process is concise, using cloud-native toolchains: the environment requires Docker, kubectl, kind, etc. Quick start steps include creating a kind cluster (execute ./hack/kind-create.sh), loading images (./hack/kind-load-images.sh), installing the Operator (./hack/install-kind.sh), and creating an inference cluster (applying the sample CR configuration). To verify the routing function, you can access the Router service via port forwarding and send a test request: curl -X POST localhost:8080/v1/chat/completions
-H 'content-type: application/json'
-d '{"conversationId":"demo-1","messages":[{"role":"user","content":"hi"}]} Applicable scenarios: coexistence of multiple models/versions, long conversation requirements, fine-grained resource management, teams with existing K8s operation and maintenance experience. For simple scenarios, it is recommended to use vLLM/TGI directly; for complex scenarios, choose this solution.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49