# SlidingServe: SLO-Aware Sliding Window Scheduling System for LLM Online Inference

> This article introduces the SlidingServe system, which uses a lightweight batch latency predictor, dynamic chunking, and multi-level priority sorting to increase LLM inference throughput by up to 30% while ensuring service quality, and reduce SLO violation rates by 16%-53% under high load.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-04T09:36:40.000Z
- 最近活动: 2026-06-05T06:52:33.915Z
- 热度: 134.7
- 关键词: LLM推理, 调度优化, SLO保障, 批处理, 服务质量, 动态规划
- 页面链接: https://www.zingnex.cn/en/forum/thread/slidingserve-llmslo
- Canonical: https://www.zingnex.cn/forum/thread/slidingserve-llmslo
- Markdown 来源: floors_fallback

---

## SlidingServe: Guide to SLO-Aware LLM Inference Scheduling System

Title: SlidingServe: SLO-Aware Sliding Window Scheduling System for LLM Online Inference

Original Author/Team: Paper Author Team (arXiv submission)
Source Platform: arXiv
Original Title: Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference
Original Link: <http://arxiv.org/abs/2606.05933v1>
Release Time: June 4, 2026

Core Insight: SlidingServe uses a lightweight batch latency predictor, dynamic chunking, and multi-level priority sorting to increase LLM inference throughput by up to 30% while ensuring service quality, and reduce SLO violation rates by 16%-53% under high load.

## Scheduling Dilemmas of LLM Online Services (Background)

## Scheduling Dilemmas of LLM Online Services

With the popularity of large language models in interactive applications, inference scheduling faces three major pain points:
1. **Prediction Difficulty**: It is hard to accurately estimate the decoding time of batch requests, leading to a lack of forward-looking scheduling decisions;
2. **Rigid Chunking**: Traditional greedy chunking strategies cannot adapt to dynamic loads, easily causing resource waste or latency violations;
3. **Conflict Between Fairness and Efficiency**: Simple priority strategies struggle to balance the guarantee of critical requests and overall system efficiency.

## Core Architecture of SlidingServe (Methodology)

## Core Architecture of SlidingServe

The core innovation of SlidingServe lies in its sliding window mechanism, which integrates current and future iteration information, and includes four major modules:
1. **Lightweight Batch Latency Predictor**: Considers multi-dimensional factors such as KV cache, sequence length, and GPU load to estimate batch execution time with low overhead;
2. **SlidingChunker Dynamic Chunking**: Combines the urgency of current requests, the next batch of new requests, and GPU status to achieve dynamic chunking;
3. **Multi-Level Priority Sorter**: Sorts requests based on urgency (remaining SLO time), service level, resource requirements, and waiting time;
4. **BatchConstructor Dynamic Programming**: Solves the optimal request set in milliseconds to maximize the number of requests that meet SLOs.

## Experimental Evaluation Results (Evidence)

## Experimental Evaluation Results

SlidingServe performs significantly under various loads:
1. **Throughput Improvement**: Compared to advanced systems, service capacity increases by up to 30%, supporting more concurrent users with the same hardware;
2. **Reduced SLO Violation Rate**: Under high load, the SLO violation rate decreases by 16%-53%, making it suitable for strict latency scenarios such as real-time dialogue;
3. **Fine-Grained QoS Support**: Can provide differentiated latency guarantees for users of different service levels without sacrificing overall efficiency.

## Effectiveness of the Sliding Window Mechanism (Technical Insight)

## Effectiveness of the Sliding Window Mechanism

The key to SlidingServe's success is breaking the single-point decision-making model:
1. **Avoid Short-Sighted Decisions**: Integrates future information to prevent greedy strategies from sacrificing long-term efficiency;
2. **Smooth Load Fluctuations**: Effectively absorbs sudden loads in LLM inference and maintains system stability;
3. **Optimize Resource Matching**: Precisely matches computing resources with request characteristics to reduce resource waste.

## Deployment Practice Insights (Recommendations)

## Deployment Practice Insights

Notes for applying SlidingServe:
1. **Predictor Calibration**: The predictor needs to be calibrated based on model, hardware, and load characteristics; the lightweight design supports continuous runtime calibration;
2. **Flexibility in SLO Definition**: Supports end-to-end latency or phased goals; it is recommended to define multi-level SLOs to leverage differentiated service capabilities;
3. **System Integration**: The modular design allows gradual integration into existing LLM service frameworks, with components that can be introduced independently.

## Limitations and Future Directions

## Limitations and Future Directions

Directions that SlidingServe still needs to explore:
1. **Heterogeneous Hardware Support**: Extend to CPU+GPU hybrid architectures or dedicated inference accelerators;
2. **Multi-Model Services**: Address the scheduling complexity of serving multiple models of different scales simultaneously;
3. **Online Learning Optimization**: Continuously optimize predictors and sorting strategies through online learning to adapt to load changes.

## Summary

## Summary

SlidingServe is an important advancement in the field of LLM inference scheduling. By integrating current and future information through a sliding window mechanism, it achieves significant improvements in throughput and efficiency under strict SLO guarantees. It provides valuable technical references for large-scale LLM service teams and helps with the large-scale deployment of AI infrastructure.