Zing Forum

Reading

SlidingServe: SLO-Aware Sliding Window Scheduling System for LLM Online Inference

This article introduces the SlidingServe system, which uses a lightweight batch latency predictor, dynamic chunking, and multi-level priority sorting to increase LLM inference throughput by up to 30% while ensuring service quality, and reduce SLO violation rates by 16%-53% under high load.

LLM推理调度优化SLO保障批处理服务质量动态规划
Published 2026-06-04 17:36Recent activity 2026-06-05 14:52Estimated read 8 min
SlidingServe: SLO-Aware Sliding Window Scheduling System for LLM Online Inference
1

Section 01

SlidingServe: Guide to SLO-Aware LLM Inference Scheduling System

Title: SlidingServe: SLO-Aware Sliding Window Scheduling System for LLM Online Inference

Original Author/Team: Paper Author Team (arXiv submission) Source Platform: arXiv Original Title: Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference Original Link: http://arxiv.org/abs/2606.05933v1 Release Time: June 4, 2026

Core Insight: SlidingServe uses a lightweight batch latency predictor, dynamic chunking, and multi-level priority sorting to increase LLM inference throughput by up to 30% while ensuring service quality, and reduce SLO violation rates by 16%-53% under high load.

2

Section 02

Scheduling Dilemmas of LLM Online Services (Background)

Scheduling Dilemmas of LLM Online Services

With the popularity of large language models in interactive applications, inference scheduling faces three major pain points:

  1. Prediction Difficulty: It is hard to accurately estimate the decoding time of batch requests, leading to a lack of forward-looking scheduling decisions;
  2. Rigid Chunking: Traditional greedy chunking strategies cannot adapt to dynamic loads, easily causing resource waste or latency violations;
  3. Conflict Between Fairness and Efficiency: Simple priority strategies struggle to balance the guarantee of critical requests and overall system efficiency.
3

Section 03

Core Architecture of SlidingServe (Methodology)

Core Architecture of SlidingServe

The core innovation of SlidingServe lies in its sliding window mechanism, which integrates current and future iteration information, and includes four major modules:

  1. Lightweight Batch Latency Predictor: Considers multi-dimensional factors such as KV cache, sequence length, and GPU load to estimate batch execution time with low overhead;
  2. SlidingChunker Dynamic Chunking: Combines the urgency of current requests, the next batch of new requests, and GPU status to achieve dynamic chunking;
  3. Multi-Level Priority Sorter: Sorts requests based on urgency (remaining SLO time), service level, resource requirements, and waiting time;
  4. BatchConstructor Dynamic Programming: Solves the optimal request set in milliseconds to maximize the number of requests that meet SLOs.
4

Section 04

Experimental Evaluation Results (Evidence)

Experimental Evaluation Results

SlidingServe performs significantly under various loads:

  1. Throughput Improvement: Compared to advanced systems, service capacity increases by up to 30%, supporting more concurrent users with the same hardware;
  2. Reduced SLO Violation Rate: Under high load, the SLO violation rate decreases by 16%-53%, making it suitable for strict latency scenarios such as real-time dialogue;
  3. Fine-Grained QoS Support: Can provide differentiated latency guarantees for users of different service levels without sacrificing overall efficiency.
5

Section 05

Effectiveness of the Sliding Window Mechanism (Technical Insight)

Effectiveness of the Sliding Window Mechanism

The key to SlidingServe's success is breaking the single-point decision-making model:

  1. Avoid Short-Sighted Decisions: Integrates future information to prevent greedy strategies from sacrificing long-term efficiency;
  2. Smooth Load Fluctuations: Effectively absorbs sudden loads in LLM inference and maintains system stability;
  3. Optimize Resource Matching: Precisely matches computing resources with request characteristics to reduce resource waste.
6

Section 06

Deployment Practice Insights (Recommendations)

Deployment Practice Insights

Notes for applying SlidingServe:

  1. Predictor Calibration: The predictor needs to be calibrated based on model, hardware, and load characteristics; the lightweight design supports continuous runtime calibration;
  2. Flexibility in SLO Definition: Supports end-to-end latency or phased goals; it is recommended to define multi-level SLOs to leverage differentiated service capabilities;
  3. System Integration: The modular design allows gradual integration into existing LLM service frameworks, with components that can be introduced independently.
7

Section 07

Limitations and Future Directions

Limitations and Future Directions

Directions that SlidingServe still needs to explore:

  1. Heterogeneous Hardware Support: Extend to CPU+GPU hybrid architectures or dedicated inference accelerators;
  2. Multi-Model Services: Address the scheduling complexity of serving multiple models of different scales simultaneously;
  3. Online Learning Optimization: Continuously optimize predictors and sorting strategies through online learning to adapt to load changes.
8

Section 08

Summary

Summary

SlidingServe is an important advancement in the field of LLM inference scheduling. By integrating current and future information through a sliding window mechanism, it achieves significant improvements in throughput and efficiency under strict SLO guarantees. It provides valuable technical references for large-scale LLM service teams and helps with the large-scale deployment of AI infrastructure.