Section 01
[Introduction] Dual-Pool Token Budget Routing: A Cost Optimization Solution for Production-Grade LLM Services
Microsoft proposes the Dual-Pool Token Budget Routing mechanism, which intelligently distributes requests to a short-context high-throughput pool and a long-context high-capacity pool. This solves the resource waste problem caused by "one-size-fits-all" configurations in production LLM services, achieving a 31-42% GPU cost reduction (equivalent to $2.86 million annually) and a significant improvement in reliability.