In LLM inference services, the Key-Value (KV) Cache is a core mechanism to improve multi-turn conversation performance. When users interact with the model in multiple turns, the system caches previously computed key-value pairs to avoid re-computing them in subsequent requests, thus significantly reducing the Time to First Token (TTFT) latency.
However, in distributed deployment scenarios, traditional load balancing strategies (such as round-robin, random, least connections) often ignore the locality characteristics of KV cache. When multiple workers process requests in parallel, subsequent requests of the same conversation may be routed to different workers, leading to cache invalidation and the need to re-compute prefixes, resulting in severe waste of computing resources.
llm-router was born to solve this problem. It intelligently identifies the prompt prefix of requests and always routes requests with the same prefix to the worker holding the cache, enabling cross-worker KV cache sharing.