Section 01
Core Introduction to the mini-llm-d Project
mini-llm-d is an experimental project written in Go that explores intelligent LLM inference request routing strategies based on KV cache occupancy patterns. It aims to solve key engineering problems in request routing for large language model service deployment and explore the application of Layer 7 load balancing in AI inference scenarios. The project addresses the unique resource characteristics of LLM inference (video memory usage is closely related to sequence length, cumulative nature of KV cache) and provides routing ideas different from traditional web services.