Reading

mini-llm-d: Intelligent LLM Inference Routing Based on KV Cache

An experimental project written in Go that implements intelligent LLM inference request routing by analyzing KV cache occupancy patterns, exploring the application of Layer 7 load balancing in AI inference scenarios.

LLM推理负载均衡KV缓存Go语言七层路由模型服务推理优化Transformer

Published 2026-05-18 11:45Recent activity 2026-05-18 11:54Estimated read 6 min

mini-llm-d: Intelligent LLM Inference Routing Based on KV Cache

Section 01

Core Introduction to the mini-llm-d Project

mini-llm-d is an experimental project written in Go that explores intelligent LLM inference request routing strategies based on KV cache occupancy patterns. It aims to solve key engineering problems in request routing for large language model service deployment and explore the application of Layer 7 load balancing in AI inference scenarios. The project addresses the unique resource characteristics of LLM inference (video memory usage is closely related to sequence length, cumulative nature of KV cache) and provides routing ideas different from traditional web services.

Section 02

Resource Characteristics of LLM Inference and Limitations of Traditional Routing

LLM inference has fundamental differences from traditional web service resource consumption patterns: traditional web service load balancing relies on uniform metrics such as CPU and memory, while LLM inference resource consumption is determined by model parameter scale (static video memory) and sequence length (dynamic KV cache), and the KV cache grows cumulatively during the generation process. Traditional round-robin or least-connection strategies cannot capture these characteristics, easily leading to GPU overload or uneven idleness.

Section 03

Core Ideas and Technology Selection of mini-llm-d

The core hypothesis of the project is to intelligently allocate requests to maximize throughput by analyzing the context length characteristics of requests and the KV cache status of backend instances. Go was chosen for implementation due to its excellent concurrency performance, powerful standard library, convenient deployment, and ability to balance development efficiency and performance. It is also a practice for the author to learn Go syntax and Layer 7 routing.

Section 04

KV Cache: The Hidden Bottleneck of LLM Inference

KV cache is used in the Transformer self-attention mechanism to avoid redundant computations, and its size is proportional to the sequence length and model dimensions. The calculation formula is: 2 × L × H × D × N × sizeof(dtype) (L = number of layers, H = number of heads, D = dimension per head, N = sequence length). Taking Llama 3 8B as an example, an 8K context requires about 4GB of cache, while a 128K context surges to more than 64GB, which is the core problem for the project's optimization.

Section 05

Design Space of Intelligent Routing Strategies

The project explores multiple routing strategies: KV cache-based prediction (predicting demand based on input length), dynamic load tracking (monitoring KV usage of instances), request feature classification (assigning instance groups by type), and hybrid strategies (combining queue length, predicted demand, etc.). The author refers to them as "(un)intelligent", acknowledging their heuristic nature while distinguishing them from traditional intelligent load balancing.

Section 06

Project Limitations and Future Optimization Directions

As a learning project, mini-llm-d has limitations: predicting KV demand ignores the uncertainty of generation length, state synchronization delays affect decision-making, cold start state evaluation issues, insufficient complex scheduling (priority/SLA), and lack of heterogeneous model processing. These are all challenges that production-level LLM gateways need to address.

Section 07

Learning Value and Expansion Possibilities

The learning value of this project for developers includes: building high-performance proxies in Go, understanding KV cache and LLM resource characteristics, and implementing Layer 7 routing. Expansion directions can include integrating engines like vLLM, complex scheduling algorithms (shortest job first), Prometheus monitoring, multi-model routing, request-level caching, etc.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15