# Valve: Production-Grade Online-Offline Inference Colocation System Saves 2170 GPUs

> Microsoft's Valve system is deployed in a production environment with 8054 GPUs. It achieves a 34.6% improvement in cluster utilization through sub-millisecond compute preemption and rate-limited memory reclamation.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-09T06:45:37.000Z
- 最近活动: 2026-04-10T04:47:58.486Z
- 热度: 118.0
- 关键词: LLM推理, GPU混部, 资源利用率, 生产部署, Valve, 在线-离线, 成本优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/valve-2170gpu
- Canonical: https://www.zingnex.cn/forum/thread/valve-2170gpu
- Markdown 来源: floors_fallback

---

## [Introduction] Valve: Production-Grade Online-Offline Inference Colocation System Saves 2170 GPUs

Microsoft's Valve system is deployed in a production environment with 8054 GPUs. Through core technologies such as sub-millisecond compute preemption, single preemption guarantee, and rate-limited memory reclamation, it achieves a 34.6% improvement in cluster utilization, equivalent to saving the cost of 2170 GPUs. The system has minimal impact on online service quality (first token time increase <5%, per token output time increase <2%) and extremely low deployment cost (only 1 line of GPU driver modification + 20 lines of inference framework patches).

## Background: Resource Dilemma of Large Model Inference and Colocation Challenges

## Resource Dilemma of Large Model Inference
Large Language Model (LLM) inference services support latency-sensitive applications, but over-provisioning strategies lead to low resource utilization and severe GPU idle waste during off-peak periods. Online-offline colocation is a solution direction, but production deployment faces dual challenges:
## Dual Challenges of Production Deployment
**Challenge 1: Online Interference Issue**
Offline tasks preemption of compute resources leads to increased online latency, and existing preemption mechanisms have defects of excessive latency or high frequency.
**Challenge 2: Deployment Complexity**
Requires extensive modifications to GPU drivers and inference frameworks, resulting in high maintenance costs and significant risks.

## Core Methods and Technical Architecture of Valve

## Valve: A Colocation Solution Prioritizing Practicality
Valve's design philosophy is 'maximize benefits with minimal invasiveness', with core innovations:
- Sub-millisecond compute preemption: Pause offline tasks within sub-milliseconds when an online request arrives
- Single preemption guarantee: Each online request triggers at most one preemption to avoid frequent switching
- Rate-limited memory reclamation: Progressive reclamation to avoid sudden latency
## Technical Implementation Architecture
- Channel-controlled compute isolation: Hardware-level isolation for microsecond-level preemption
- Page-fault-free memory reclamation: Pre-allocation pool + incremental strategy to reduce overhead
- Dynamic memory reservation: Intelligently adjust reservation amount to balance demand and waste
## Minimal Deployment Cost
Only requires 1 line of GPU driver modification + 20 lines of inference framework patches, with extremely low invasiveness, easy integration, low maintenance cost, and controllable risk.

## Production Verification: 34.6% Utilization Improvement and 2170 GPU Savings

Valve was verified in a production environment with 8054 GPUs:
- 34.6% improvement in cluster utilization, saving the cost of 2170 GPUs
- Minimal impact on online services: First token time (TTFT) increase <5%, per token output time (TPOT) increase <2%
- Stable across workloads: Consistent performance for short/long text tasks and low/peak periods.

## Conclusion and Industry Insights: Colocation Feasibility and Engineering Pragmatism

## Industry Insights
1. Efficient colocation in production environments is feasible; the key lies in correct technical abstractions (such as sub-millisecond preemption)
2. Deployability is important: Minimal modification strategy balances technical advancement and engineering practicality
3. Value of hardware-software co-design: Deep dive into GPU architecture to achieve high performance
## Conclusion
Valve provides an effective solution for LLM inference cost optimization. The 2170 GPU savings reflect sustainable development value, and it will become more important as LLMs become more popular in the future.

## Limitations and Future Directions: Hardware Adaptation and Emerging Scenario Optimization

## Limitations
- Currently mainly adapted to NVIDIA GPUs; other hardware (AMD, Intel accelerators) require additional adaptation
- May affect online tasks under extreme memory pressure
## Future Directions
- Explore more intelligent memory prediction and pre-allocation strategies
- Specialized optimization for multimodal models and Agent systems
- Expand to more hardware platforms
