# mini-llm-d: Intelligent LLM Inference Routing Based on KV Cache

> An experimental project written in Go that implements intelligent LLM inference request routing by analyzing KV cache occupancy patterns, exploring the application of Layer 7 load balancing in AI inference scenarios.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-18T03:45:47.000Z
- 最近活动: 2026-05-18T03:54:11.372Z
- 热度: 150.9
- 关键词: LLM推理, 负载均衡, KV缓存, Go语言, 七层路由, 模型服务, 推理优化, Transformer
- 页面链接: https://www.zingnex.cn/en/forum/thread/mini-llm-d-kvllm
- Canonical: https://www.zingnex.cn/forum/thread/mini-llm-d-kvllm
- Markdown 来源: floors_fallback

---

## Core Introduction to the mini-llm-d Project

mini-llm-d is an experimental project written in Go that explores intelligent LLM inference request routing strategies based on KV cache occupancy patterns. It aims to solve key engineering problems in request routing for large language model service deployment and explore the application of Layer 7 load balancing in AI inference scenarios. The project addresses the unique resource characteristics of LLM inference (video memory usage is closely related to sequence length, cumulative nature of KV cache) and provides routing ideas different from traditional web services.

## Resource Characteristics of LLM Inference and Limitations of Traditional Routing

LLM inference has fundamental differences from traditional web service resource consumption patterns: traditional web service load balancing relies on uniform metrics such as CPU and memory, while LLM inference resource consumption is determined by model parameter scale (static video memory) and sequence length (dynamic KV cache), and the KV cache grows cumulatively during the generation process. Traditional round-robin or least-connection strategies cannot capture these characteristics, easily leading to GPU overload or uneven idleness.

## Core Ideas and Technology Selection of mini-llm-d

The core hypothesis of the project is to intelligently allocate requests to maximize throughput by analyzing the context length characteristics of requests and the KV cache status of backend instances. Go was chosen for implementation due to its excellent concurrency performance, powerful standard library, convenient deployment, and ability to balance development efficiency and performance. It is also a practice for the author to learn Go syntax and Layer 7 routing.

## KV Cache: The Hidden Bottleneck of LLM Inference

KV cache is used in the Transformer self-attention mechanism to avoid redundant computations, and its size is proportional to the sequence length and model dimensions. The calculation formula is: 2 × L × H × D × N × sizeof(dtype) (L = number of layers, H = number of heads, D = dimension per head, N = sequence length). Taking Llama 3 8B as an example, an 8K context requires about 4GB of cache, while a 128K context surges to more than 64GB, which is the core problem for the project's optimization.

## Design Space of Intelligent Routing Strategies

The project explores multiple routing strategies: KV cache-based prediction (predicting demand based on input length), dynamic load tracking (monitoring KV usage of instances), request feature classification (assigning instance groups by type), and hybrid strategies (combining queue length, predicted demand, etc.). The author refers to them as "(un)intelligent", acknowledging their heuristic nature while distinguishing them from traditional intelligent load balancing.

## Project Limitations and Future Optimization Directions

As a learning project, mini-llm-d has limitations: predicting KV demand ignores the uncertainty of generation length, state synchronization delays affect decision-making, cold start state evaluation issues, insufficient complex scheduling (priority/SLA), and lack of heterogeneous model processing. These are all challenges that production-level LLM gateways need to address.

## Learning Value and Expansion Possibilities

The learning value of this project for developers includes: building high-performance proxies in Go, understanding KV cache and LLM resource characteristics, and implementing Layer 7 routing. Expansion directions can include integrating engines like vLLM, complex scheduling algorithms (shortest job first), Prometheus monitoring, multi-model routing, request-level caching, etc.
