Zing Forum

Reading

Valve: Production-Grade Online-Offline Inference Colocation System Saves 2170 GPUs

Microsoft's Valve system is deployed in a production environment with 8054 GPUs. It achieves a 34.6% improvement in cluster utilization through sub-millisecond compute preemption and rate-limited memory reclamation.

LLM推理GPU混部资源利用率生产部署Valve在线-离线成本优化
Published 2026-04-09 14:45Recent activity 2026-04-10 12:47Estimated read 6 min
Valve: Production-Grade Online-Offline Inference Colocation System Saves 2170 GPUs
1

Section 01

[Introduction] Valve: Production-Grade Online-Offline Inference Colocation System Saves 2170 GPUs

Microsoft's Valve system is deployed in a production environment with 8054 GPUs. Through core technologies such as sub-millisecond compute preemption, single preemption guarantee, and rate-limited memory reclamation, it achieves a 34.6% improvement in cluster utilization, equivalent to saving the cost of 2170 GPUs. The system has minimal impact on online service quality (first token time increase <5%, per token output time increase <2%) and extremely low deployment cost (only 1 line of GPU driver modification + 20 lines of inference framework patches).

2

Section 02

Background: Resource Dilemma of Large Model Inference and Colocation Challenges

Resource Dilemma of Large Model Inference

Large Language Model (LLM) inference services support latency-sensitive applications, but over-provisioning strategies lead to low resource utilization and severe GPU idle waste during off-peak periods. Online-offline colocation is a solution direction, but production deployment faces dual challenges:

Dual Challenges of Production Deployment

Challenge 1: Online Interference Issue Offline tasks preemption of compute resources leads to increased online latency, and existing preemption mechanisms have defects of excessive latency or high frequency. Challenge 2: Deployment Complexity Requires extensive modifications to GPU drivers and inference frameworks, resulting in high maintenance costs and significant risks.

3

Section 03

Core Methods and Technical Architecture of Valve

Valve: A Colocation Solution Prioritizing Practicality

Valve's design philosophy is 'maximize benefits with minimal invasiveness', with core innovations:

  • Sub-millisecond compute preemption: Pause offline tasks within sub-milliseconds when an online request arrives
  • Single preemption guarantee: Each online request triggers at most one preemption to avoid frequent switching
  • Rate-limited memory reclamation: Progressive reclamation to avoid sudden latency

Technical Implementation Architecture

  • Channel-controlled compute isolation: Hardware-level isolation for microsecond-level preemption
  • Page-fault-free memory reclamation: Pre-allocation pool + incremental strategy to reduce overhead
  • Dynamic memory reservation: Intelligently adjust reservation amount to balance demand and waste

Minimal Deployment Cost

Only requires 1 line of GPU driver modification + 20 lines of inference framework patches, with extremely low invasiveness, easy integration, low maintenance cost, and controllable risk.

4

Section 04

Production Verification: 34.6% Utilization Improvement and 2170 GPU Savings

Valve was verified in a production environment with 8054 GPUs:

  • 34.6% improvement in cluster utilization, saving the cost of 2170 GPUs
  • Minimal impact on online services: First token time (TTFT) increase <5%, per token output time (TPOT) increase <2%
  • Stable across workloads: Consistent performance for short/long text tasks and low/peak periods.
5

Section 05

Conclusion and Industry Insights: Colocation Feasibility and Engineering Pragmatism

Industry Insights

  1. Efficient colocation in production environments is feasible; the key lies in correct technical abstractions (such as sub-millisecond preemption)
  2. Deployability is important: Minimal modification strategy balances technical advancement and engineering practicality
  3. Value of hardware-software co-design: Deep dive into GPU architecture to achieve high performance

Conclusion

Valve provides an effective solution for LLM inference cost optimization. The 2170 GPU savings reflect sustainable development value, and it will become more important as LLMs become more popular in the future.

6

Section 06

Limitations and Future Directions: Hardware Adaptation and Emerging Scenario Optimization

Limitations

  • Currently mainly adapted to NVIDIA GPUs; other hardware (AMD, Intel accelerators) require additional adaptation
  • May affect online tasks under extreme memory pressure

Future Directions

  • Explore more intelligent memory prediction and pre-allocation strategies
  • Specialized optimization for multimodal models and Agent systems
  • Expand to more hardware platforms