Zing Forum

Reading

Dual-Pool Token Budget Routing: A Production-Grade LLM Service Solution Saving 42% GPU Costs

Microsoft proposes the Dual-Pool Token Budget Routing mechanism, which intelligently distributes requests to a short-context high-throughput pool and a long-context high-capacity pool, achieving an annual GPU cost saving of $2.86 million.

LLM服务成本优化请求路由GPU利用率令牌预算双池架构vLLM
Published 2026-04-09 18:47Recent activity 2026-04-10 12:49Estimated read 5 min
Dual-Pool Token Budget Routing: A Production-Grade LLM Service Solution Saving 42% GPU Costs
1

Section 01

[Introduction] Dual-Pool Token Budget Routing: A Cost Optimization Solution for Production-Grade LLM Services

Microsoft proposes the Dual-Pool Token Budget Routing mechanism, which intelligently distributes requests to a short-context high-throughput pool and a long-context high-capacity pool. This solves the resource waste problem caused by "one-size-fits-all" configurations in production LLM services, achieving a 31-42% GPU cost reduction (equivalent to $2.86 million annually) and a significant improvement in reliability.

2

Section 02

Configuration Dilemma in Production LLM Services

Current inference systems like vLLM use "one-size-fits-all" configurations (provisioned for the worst-case long context), but in reality, 80-95% of requests are short-context (<2K tokens), leading to three types of losses: throughput capacity waste (4-8x), reliability issues (OOM crashes, request preemption), and cost surges.

3

Section 03

Core Idea of Dual-Pool Token Budget Routing

Divide the GPU cluster into two specialized pools: a high-throughput short-context pool (optimized for concurrent processing) and a high-capacity long-context pool (for handling long-context requests). The key lies in accurately estimating the total token budget of a request (input prompt + expected output) to enable intelligent routing.

4

Section 04

Token Budget Estimation Method Using Online Learning

An online learning method without a tokenizer is used: 1. Byte-based token estimation (analyzing byte-token conversion ratios); 2. Exponential moving average learning (dynamically updating ratios to adapt to load changes); 3. Category-aware granularity (learning different ratios for different request categories).

5

Section 05

Experimental Validation and Benefit Results

Validated on real datasets (Azure LLM, LMSYS-Chat-1M): GPU hours reduced by 31-42% (annual saving of $2.86 million); preemption rate decreased by 5.4x, P99 first-token time improved by 6%; for large-scale scenarios (Qwen3-235B + MI300X, 10,000 requests/second), the expected annual saving is $15.4 million.

6

Section 06

Technical Features and Advantages of Dual-Pool Routing

Technical advantages include: O(1) distribution overhead (no bottlenecks), automatic adaptation to heterogeneous workloads, seamless integration with existing optimizations (e.g., PagedAttention), and no need to modify models or frameworks (pure infrastructure optimization).

7

Section 07

Implications for LLM Service Architecture

Implications include: emphasizing request heterogeneity (to avoid resource waste), the value of online learning (adapting to dynamic loads), layered optimization strategies (global optimization of routing layer + service layer), and cost-conscious design (taking cost-effectiveness as a core consideration).

8

Section 08

Limitations and Future Directions

Current limitations are the binary division of request lengths; future directions: exploring multi-level pool designs, more complex prediction models (content-based deep estimation), and adapting to model scale growth and new hardware platforms.