Zing Forum

Reading

Prefix-Aware LLM Inference Gateway: Enabling Intelligent KV Cache Routing for LLM Inference Clusters

An open-source OpenAI-compatible inference gateway that routes requests to GPU nodes holding matching KV caches via prefix-aware routing, achieving a 99% cache hit rate and reducing latency by 53% compared to round-robin strategies

LLM推理KV缓存负载均衡前缀路由vLLMOpenAI网关多租户GPU优化缓存命中率
Published 2026-06-02 03:43Recent activity 2026-06-02 03:48Estimated read 5 min
Prefix-Aware LLM Inference Gateway: Enabling Intelligent KV Cache Routing for LLM Inference Clusters
1

Section 01

Prefix-Aware LLM Inference Gateway: Core Overview

This is an OpenAI-compatible open-source inference gateway for LLM clusters. It solves the stateful challenge of LLM inference by using prefix-aware routing to direct requests to GPU nodes holding matching KV caches. Key achievements: 99% cache hit rate (vs 40% round-robin), 53% lower latency in low-load scenarios, and maintains load balance. It's a cross-platform alternative to Google GKE Inference Gateway.

2

Section 02

Background: Stateful Challenge in LLM Inference

LLM inference is stateful—during prefill, KV caches are stored in GPU memory and reused for subsequent requests with the same prefix (e.g., vLLM). However, traditional stateless load balancers (like round-robin) route requests to different nodes, leading to repeated computation of identical prefixes, wasted VRAM, and high tail latency.

3

Section 03

Core Mechanisms: Prefix Affinity & Load Awareness

The gateway uses two layers of intelligent routing:

  1. Prefix Affinity: A path-compressed radix tree stores token block hashes mapped to backend nodes, enabling longest prefix matching. Each backend maintains an LRU model mirroring its KV cache eviction.
  2. Load Awareness: Estimates TTFT (est_TTFT = queue_delay + prefill(uncached_tokens)). When nodes are saturated, it uses minimal load + round-robin to avoid traffic concentration. Hot prefixes are auto-replicated to other nodes.
4

Section 04

Performance Validation: Cache Hit Rate & Latency

Cache Hit Rate: In a 3-node cluster, the gateway achieves 99.3% hit rate (vs 40.5% round-robin) for uniform docs, and 98.3% HTTP end-to-end. Real GPU Test: With 2× A40 GPUs and Qwen2.5-1.5B-Instruct:

  • c=1: TTFT reduced by 53% (166.4ms →77.4ms).
  • c=32: p95 latency improved by -9% (despite +5% mean). Python vs Rust: Rust version has ~52x higher throughput (10,922 req/s vs209) and ~33x lower p99 latency (12.2ms vs399.5ms).
5

Section 05

Multi-Tenant Fairness & Advanced Features

Multi-Tenant: Per-tenant rate limiting ensures 100% service rate for polite tenants (vs16% with shared buckets). Advanced Features:

  • Fault tolerance: Circuit breakers, safe failover before first byte.
  • Authentication: API key support with sha256 hashes (no plaintext).
  • Observability: JSON logs, Prometheus metrics, OpenTelemetry tracing, Grafana dashboards.
  • Control plane: Runtime backend management, node maintenance mode, cluster config propagation via Redis/Gossip.
6

Section 06

Architecture & Deployment Options

Architecture:

  • Data plane: Tenant identification → admission → block hash → radix tree match → TTFT-based selection → streaming.
  • Control plane: Metrics scraping → state reconciliation. Deployment: Supports Docker Compose, with benchmark tools (sim.py, e2e_inproc.py).
7

Section 07

Practice Value & Key Takeaways

Practice Significance: For LLM cluster teams, it reduces latency (via KV reuse), cuts costs (more requests per hardware), improves stability (fault tolerance), ensures fair scheduling (multi-tenant), and is production-ready (observability/management APIs). Conclusion: The gateway solves LLM inference stateful challenges with prefix-aware routing, achieving near-perfect cache hit rate while balancing load. Its mechanisms (radix tree, TTFT estimation) are applicable to other stateful services. It's a valuable solution for LLM infrastructure teams.