Zing Forum

Reading

Meridian: A Phase-Aware vLLM Scheduler for Inference Models

Meridian achieves more efficient LLM service scheduling by distinguishing between the 'thinking phase' and 'output phase' of inference models, significantly improving response speed in the output phase.

vLLM推理模型LLM调度KV缓存DeepSeek-R1Qwen3熵优化CUDA
Published 2026-05-21 12:43Recent activity 2026-05-21 12:55Estimated read 5 min
Meridian: A Phase-Aware vLLM Scheduler for Inference Models
1

Section 01

Meridian: Core Guide to the Phase-Aware vLLM Scheduler

Meridian is a vLLM scheduling layer designed for inference models. By distinguishing between the 'thinking phase' and 'output phase' of inference models and applying different service strategies, it significantly improves response speed in the output phase while maintaining throughput in the thinking phase. Its core innovation lies in the phase-aware scheduling mechanism, which addresses the output latency issue caused by traditional continuous batch schedulers treating both phases equally.

2

Section 02

Unique Challenges in Inference Model Scheduling and Limitations of Traditional Solutions

With the popularity of inference models like DeepSeek-R1 and Qwen3, LLM operations exhibit a two-phase structure: user input → thinking phase (internal inference tokens, invisible to users, high latency tolerance, throughput-oriented) → output phase (visible to users, zero latency tolerance, latency-oriented). Traditional schedulers treat both phases equally, using the same priority queue and latency targets, leading to output phase latency being dynamically dragged down by batch processing in the thinking phase.

3

Section 03

Core Design of Meridian: Dual Queues, Phase-Aware Cache, and Entropy Optimization

Meridian's core design includes: 1. Dual queue scheduling: output phase queue (highest priority, strict TTOT target), thinking phase queue (loose TPOT target, 2.5x batch processing budget); 2. Phase-aware KV cache eviction: priority order of ThinkComplete → ThinkActive → OutputCritical; 3. Entropy-based budget control: integrates EAT (convergence detection) and RPDI (reasoning sufficiency judgment) signals to intelligently terminate the thinking phase.

4

Section 04

Highlights of Meridian's Technical Implementation

Meridian's technical highlights: 1. Zero-intrusive vLLM plugin: wraps existing schedulers via property delegation, no need to modify vLLM source code, supports quick trial and rollback; 2. Separated KV transmission support: compatible with frameworks like NIXL and Mooncake; 3. CUDA optimization: entropy calculation and EAT kernels run on independent secondary CUDA streams, core logic written in Rust, with Python bindings provided by PyO3.

5

Section 05

Applicable Scenarios and Value of Meridian

Meridian is suitable for: 1. High-concurrency inference services (output latency isolation when handling hundreds of requests); 2. Interactive applications (chatbots etc. requiring fast responses); 3. Cost optimization (aggressive batch processing in the thinking phase without affecting user experience).

6

Section 06

Limitations and Positioning of Meridian

Meridian's explicit non-goals: it is not a throughput optimizer, accuracy guarantor, or complete inference engine. It is an optimization tool focused on the scheduling layer, complementary to vLLM.

7

Section 07

Conclusion: Future Significance of Phase-Aware Scheduling

Meridian represents the evolutionary direction of LLM service architecture: shifting from 'one-size-fits-all' batch processing to phase-specific refined scheduling. As inference models become mainstream, such optimizations will become more important. It is recommended that large-scale inference service teams evaluate Meridian.