Reading

Dual-Pool Token Budget Routing: A Production-Grade LLM Service Solution Saving 42% GPU Costs

LLM服务成本优化请求路由GPU利用率令牌预算双池架构vLLM

Published 2026-04-09 18:47Recent activity 2026-04-10 12:49Estimated read 5 min

Dual-Pool Token Budget Routing: A Production-Grade LLM Service Solution Saving 42% GPU Costs

Section 01

[Introduction] Dual-Pool Token Budget Routing: A Cost Optimization Solution for Production-Grade LLM Services

Microsoft proposes the Dual-Pool Token Budget Routing mechanism, which intelligently distributes requests to a short-context high-throughput pool and a long-context high-capacity pool. This solves the resource waste problem caused by "one-size-fits-all" configurations in production LLM services, achieving a 31-42% GPU cost reduction (equivalent to $2.86 million annually) and a significant improvement in reliability.

Section 02

Configuration Dilemma in Production LLM Services

Current inference systems like vLLM use "one-size-fits-all" configurations (provisioned for the worst-case long context), but in reality, 80-95% of requests are short-context (<2K tokens), leading to three types of losses: throughput capacity waste (4-8x), reliability issues (OOM crashes, request preemption), and cost surges.

Section 03

Core Idea of Dual-Pool Token Budget Routing

Divide the GPU cluster into two specialized pools: a high-throughput short-context pool (optimized for concurrent processing) and a high-capacity long-context pool (for handling long-context requests). The key lies in accurately estimating the total token budget of a request (input prompt + expected output) to enable intelligent routing.

Section 04

Token Budget Estimation Method Using Online Learning

An online learning method without a tokenizer is used: 1. Byte-based token estimation (analyzing byte-token conversion ratios); 2. Exponential moving average learning (dynamically updating ratios to adapt to load changes); 3. Category-aware granularity (learning different ratios for different request categories).

Section 05

Experimental Validation and Benefit Results

Validated on real datasets (Azure LLM, LMSYS-Chat-1M): GPU hours reduced by 31-42% (annual saving of $2.86 million); preemption rate decreased by 5.4x, P99 first-token time improved by 6%; for large-scale scenarios (Qwen3-235B + MI300X, 10,000 requests/second), the expected annual saving is $15.4 million.

Section 06

Technical Features and Advantages of Dual-Pool Routing

Technical advantages include: O(1) distribution overhead (no bottlenecks), automatic adaptation to heterogeneous workloads, seamless integration with existing optimizations (e.g., PagedAttention), and no need to modify models or frameworks (pure infrastructure optimization).

Section 07

Implications for LLM Service Architecture

Implications include: emphasizing request heterogeneity (to avoid resource waste), the value of online learning (adapting to dynamic loads), layered optimization strategies (global optimization of routing layer + service layer), and cost-conscious design (taking cost-effectiveness as a core consideration).

Section 08

Limitations and Future Directions

Current limitations are the binary division of request lengths; future directions: exploring multi-level pool designs, more complex prediction models (content-based deep estimation), and adapting to model scale growth and new hardware platforms.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15