Reading

Prefix-Aware LLM Inference Gateway: Enabling Intelligent KV Cache Routing for LLM Inference Clusters

An open-source OpenAI-compatible inference gateway that routes requests to GPU nodes holding matching KV caches via prefix-aware routing, achieving a 99% cache hit rate and reducing latency by 53% compared to round-robin strategies

LLM推理KV缓存负载均衡前缀路由vLLMOpenAI网关多租户GPU优化缓存命中率

Published 2026-06-02 03:43Recent activity 2026-06-02 03:48Estimated read 5 min

Prefix-Aware LLM Inference Gateway: Enabling Intelligent KV Cache Routing for LLM Inference Clusters

Section 01

Prefix-Aware LLM Inference Gateway: Core Overview

This is an OpenAI-compatible open-source inference gateway for LLM clusters. It solves the stateful challenge of LLM inference by using prefix-aware routing to direct requests to GPU nodes holding matching KV caches. Key achievements: 99% cache hit rate (vs 40% round-robin), 53% lower latency in low-load scenarios, and maintains load balance. It's a cross-platform alternative to Google GKE Inference Gateway.

Section 02

Background: Stateful Challenge in LLM Inference

LLM inference is stateful—during prefill, KV caches are stored in GPU memory and reused for subsequent requests with the same prefix (e.g., vLLM). However, traditional stateless load balancers (like round-robin) route requests to different nodes, leading to repeated computation of identical prefixes, wasted VRAM, and high tail latency.

Section 03

Core Mechanisms: Prefix Affinity & Load Awareness

The gateway uses two layers of intelligent routing:

Prefix Affinity: A path-compressed radix tree stores token block hashes mapped to backend nodes, enabling longest prefix matching. Each backend maintains an LRU model mirroring its KV cache eviction.
Load Awareness: Estimates TTFT (est_TTFT = queue_delay + prefill(uncached_tokens)). When nodes are saturated, it uses minimal load + round-robin to avoid traffic concentration. Hot prefixes are auto-replicated to other nodes.

Section 04

Performance Validation: Cache Hit Rate & Latency

Cache Hit Rate: In a 3-node cluster, the gateway achieves 99.3% hit rate (vs 40.5% round-robin) for uniform docs, and 98.3% HTTP end-to-end. Real GPU Test: With 2× A40 GPUs and Qwen2.5-1.5B-Instruct:

c=1: TTFT reduced by 53% (166.4ms →77.4ms).
c=32: p95 latency improved by -9% (despite +5% mean). Python vs Rust: Rust version has ~52x higher throughput (10,922 req/s vs209) and ~33x lower p99 latency (12.2ms vs399.5ms).

Section 05

Multi-Tenant Fairness & Advanced Features

Multi-Tenant: Per-tenant rate limiting ensures 100% service rate for polite tenants (vs16% with shared buckets). Advanced Features:

Fault tolerance: Circuit breakers, safe failover before first byte.
Authentication: API key support with sha256 hashes (no plaintext).
Observability: JSON logs, Prometheus metrics, OpenTelemetry tracing, Grafana dashboards.
Control plane: Runtime backend management, node maintenance mode, cluster config propagation via Redis/Gossip.

Section 06

Architecture & Deployment Options

Architecture:

Data plane: Tenant identification → admission → block hash → radix tree match → TTFT-based selection → streaming.
Control plane: Metrics scraping → state reconciliation. Deployment: Supports Docker Compose, with benchmark tools (sim.py, e2e_inproc.py).

Section 07

Practice Value & Key Takeaways

Practice Significance: For LLM cluster teams, it reduces latency (via KV reuse), cuts costs (more requests per hardware), improves stability (fault tolerance), ensures fair scheduling (multi-tenant), and is production-ready (observability/management APIs). Conclusion: The gateway solves LLM inference stateful challenges with prefix-aware routing, achieving near-perfect cache hit rate while balancing load. Its mechanisms (radix tree, TTFT estimation) are applicable to other stateful services. It's a valuable solution for LLM infrastructure teams.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15