Reading

SlidingServe: SLO-Aware Sliding Window Scheduling System for LLM Online Inference

This article introduces the SlidingServe system, which uses a lightweight batch latency predictor, dynamic chunking, and multi-level priority sorting to increase LLM inference throughput by up to 30% while ensuring service quality, and reduce SLO violation rates by 16%-53% under high load.

LLM推理调度优化SLO保障批处理服务质量动态规划

Published 2026-06-04 17:36Recent activity 2026-06-05 14:52Estimated read 8 min

SlidingServe: SLO-Aware Sliding Window Scheduling System for LLM Online Inference

Section 01

SlidingServe: Guide to SLO-Aware LLM Inference Scheduling System

Title: SlidingServe: SLO-Aware Sliding Window Scheduling System for LLM Online Inference

Original Author/Team: Paper Author Team (arXiv submission) Source Platform: arXiv Original Title: Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference Original Link: http://arxiv.org/abs/2606.05933v1 Release Time: June 4, 2026

Core Insight: SlidingServe uses a lightweight batch latency predictor, dynamic chunking, and multi-level priority sorting to increase LLM inference throughput by up to 30% while ensuring service quality, and reduce SLO violation rates by 16%-53% under high load.

Section 02

Scheduling Dilemmas of LLM Online Services (Background)

Scheduling Dilemmas of LLM Online Services

With the popularity of large language models in interactive applications, inference scheduling faces three major pain points:

Prediction Difficulty: It is hard to accurately estimate the decoding time of batch requests, leading to a lack of forward-looking scheduling decisions;
Rigid Chunking: Traditional greedy chunking strategies cannot adapt to dynamic loads, easily causing resource waste or latency violations;
Conflict Between Fairness and Efficiency: Simple priority strategies struggle to balance the guarantee of critical requests and overall system efficiency.

Section 03

Core Architecture of SlidingServe (Methodology)

Core Architecture of SlidingServe

The core innovation of SlidingServe lies in its sliding window mechanism, which integrates current and future iteration information, and includes four major modules:

Lightweight Batch Latency Predictor: Considers multi-dimensional factors such as KV cache, sequence length, and GPU load to estimate batch execution time with low overhead;
SlidingChunker Dynamic Chunking: Combines the urgency of current requests, the next batch of new requests, and GPU status to achieve dynamic chunking;
Multi-Level Priority Sorter: Sorts requests based on urgency (remaining SLO time), service level, resource requirements, and waiting time;
BatchConstructor Dynamic Programming: Solves the optimal request set in milliseconds to maximize the number of requests that meet SLOs.

Section 04

Experimental Evaluation Results (Evidence)

Experimental Evaluation Results

SlidingServe performs significantly under various loads:

Throughput Improvement: Compared to advanced systems, service capacity increases by up to 30%, supporting more concurrent users with the same hardware;
Reduced SLO Violation Rate: Under high load, the SLO violation rate decreases by 16%-53%, making it suitable for strict latency scenarios such as real-time dialogue;
Fine-Grained QoS Support: Can provide differentiated latency guarantees for users of different service levels without sacrificing overall efficiency.

Section 05

Effectiveness of the Sliding Window Mechanism (Technical Insight)

Effectiveness of the Sliding Window Mechanism

The key to SlidingServe's success is breaking the single-point decision-making model:

Avoid Short-Sighted Decisions: Integrates future information to prevent greedy strategies from sacrificing long-term efficiency;
Smooth Load Fluctuations: Effectively absorbs sudden loads in LLM inference and maintains system stability;
Optimize Resource Matching: Precisely matches computing resources with request characteristics to reduce resource waste.

Section 06

Deployment Practice Insights (Recommendations)

Deployment Practice Insights

Notes for applying SlidingServe:

Predictor Calibration: The predictor needs to be calibrated based on model, hardware, and load characteristics; the lightweight design supports continuous runtime calibration;
Flexibility in SLO Definition: Supports end-to-end latency or phased goals; it is recommended to define multi-level SLOs to leverage differentiated service capabilities;
System Integration: The modular design allows gradual integration into existing LLM service frameworks, with components that can be introduced independently.

Section 07

Limitations and Future Directions

Directions that SlidingServe still needs to explore:

Heterogeneous Hardware Support: Extend to CPU+GPU hybrid architectures or dedicated inference accelerators;
Multi-Model Services: Address the scheduling complexity of serving multiple models of different scales simultaneously;
Online Learning Optimization: Continuously optimize predictors and sorting strategies through online learning to adapt to load changes.

Section 08

Summary

SlidingServe is an important advancement in the field of LLM inference scheduling. By integrating current and future information through a sliding window mechanism, it achieves significant improvements in throughput and efficiency under strict SLO guarantees. It provides valuable technical references for large-scale LLM service teams and helps with the large-scale deployment of AI infrastructure.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49