# Dual-Pool Token Budget Routing: A Production-Grade LLM Service Solution Saving 42% GPU Costs

> Microsoft proposes the Dual-Pool Token Budget Routing mechanism, which intelligently distributes requests to a short-context high-throughput pool and a long-context high-capacity pool, achieving an annual GPU cost saving of $2.86 million.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-09T10:47:20.000Z
- 最近活动: 2026-04-10T04:49:11.720Z
- 热度: 140.0
- 关键词: LLM服务, 成本优化, 请求路由, GPU利用率, 令牌预算, 双池架构, vLLM
- 页面链接: https://www.zingnex.cn/en/forum/thread/42-gpullm
- Canonical: https://www.zingnex.cn/forum/thread/42-gpullm
- Markdown 来源: floors_fallback

---

## [Introduction] Dual-Pool Token Budget Routing: A Cost Optimization Solution for Production-Grade LLM Services

Microsoft proposes the Dual-Pool Token Budget Routing mechanism, which intelligently distributes requests to a short-context high-throughput pool and a long-context high-capacity pool. This solves the resource waste problem caused by "one-size-fits-all" configurations in production LLM services, achieving a 31-42% GPU cost reduction (equivalent to $2.86 million annually) and a significant improvement in reliability.

## Configuration Dilemma in Production LLM Services

Current inference systems like vLLM use "one-size-fits-all" configurations (provisioned for the worst-case long context), but in reality, 80-95% of requests are short-context (<2K tokens), leading to three types of losses: throughput capacity waste (4-8x), reliability issues (OOM crashes, request preemption), and cost surges.

## Core Idea of Dual-Pool Token Budget Routing

Divide the GPU cluster into two specialized pools: a high-throughput short-context pool (optimized for concurrent processing) and a high-capacity long-context pool (for handling long-context requests). The key lies in accurately estimating the total token budget of a request (input prompt + expected output) to enable intelligent routing.

## Token Budget Estimation Method Using Online Learning

An online learning method without a tokenizer is used: 1. Byte-based token estimation (analyzing byte-token conversion ratios); 2. Exponential moving average learning (dynamically updating ratios to adapt to load changes); 3. Category-aware granularity (learning different ratios for different request categories).

## Experimental Validation and Benefit Results

Validated on real datasets (Azure LLM, LMSYS-Chat-1M): GPU hours reduced by 31-42% (annual saving of $2.86 million); preemption rate decreased by 5.4x, P99 first-token time improved by 6%; for large-scale scenarios (Qwen3-235B + MI300X, 10,000 requests/second), the expected annual saving is $15.4 million.

## Technical Features and Advantages of Dual-Pool Routing

Technical advantages include: O(1) distribution overhead (no bottlenecks), automatic adaptation to heterogeneous workloads, seamless integration with existing optimizations (e.g., PagedAttention), and no need to modify models or frameworks (pure infrastructure optimization).

## Implications for LLM Service Architecture

Implications include: emphasizing request heterogeneity (to avoid resource waste), the value of online learning (adapting to dynamic loads), layered optimization strategies (global optimization of routing layer + service layer), and cost-conscious design (taking cost-effectiveness as a core consideration).

## Limitations and Future Directions

Current limitations are the binary division of request lengths; future directions: exploring multi-level pool designs, more complex prediction models (content-based deep estimation), and adapting to model scale growth and new hardware platforms.
