Section 01
[Introduction] Valve: Production-Grade Online-Offline Inference Colocation System Saves 2170 GPUs
Microsoft's Valve system is deployed in a production environment with 8054 GPUs. Through core technologies such as sub-millisecond compute preemption, single preemption guarantee, and rate-limited memory reclamation, it achieves a 34.6% improvement in cluster utilization, equivalent to saving the cost of 2170 GPUs. The system has minimal impact on online service quality (first token time increase <5%, per token output time increase <2%) and extremely low deployment cost (only 1 line of GPU driver modification + 20 lines of inference framework patches).