Reading

Kairos: An SLO-Aware Scheduling System for Disaggregated LLM Inference

This article introduces the Kairos scheduling system, which addresses the SLO attainment issue caused by the long-tail distribution of request lengths in disaggregated LLM inference architectures through urgency-first scheduling and slack-guided adaptive batching mechanisms.

LLM推理分离式架构SLO调度TTFTTPOT请求调度连续批处理长尾分布

Published 2026-05-04 16:29Recent activity 2026-05-05 11:22Estimated read 5 min

Kairos: An SLO-Aware Scheduling System for Disaggregated LLM Inference

Section 01

Kairos: Guide to the SLO-Aware Scheduling System for Disaggregated LLM Inference

This article introduces the Kairos scheduling system, which targets the SLO attainment problem caused by the long-tail distribution of request lengths in disaggregated LLM inference architectures. It optimizes two key SLO metrics—TTFT (Time to First Token) and TPOT (Time per Output Token)—through urgency-first scheduling in the prefill phase and slack-guided adaptive batching in the decoding phase, significantly improving SLO attainment rate and system throughput.

Section 02

Scheduling Challenges for LLM Inference in Production Environments

In LLM production deployment, meeting strict SLOs is a core challenge. LLM inference request lengths follow a long-tail distribution. In disaggregated architectures: long requests in the prefill phase cause head-of-line blocking; slow requests in the decoding phase lead to underutilization of resources. Existing FCFS (prefill) and continuous batching (decoding) strategies lack adaptability to LLM-specific workloads, resulting in compromised SLO attainment and suboptimal throughput.

Section 03

Prefill Phase: Prediction-Driven Urgency Scheduling

Kairos uses an urgency-first scheduling strategy in the prefill phase. Traditional FCFS causes long requests to block short ones, but Kairos predicts the prefill completion time of requests and prioritizes those that can finish within the TTFT SLO deadline, maximizing the TTFT SLO attainment rate. This strategy relies on a cost model based on request features (input length, model configuration, etc.) to estimate prefill time; even if predictions are not 100% accurate, it can significantly improve scheduling performance.

Section 04

Decoding Phase: Slack Time-Guided Adaptive Batching

In the decoding phase, Kairos proposes a slack-guided adaptive batching strategy. In continuous batching, slow requests slow down the entire batch. Kairos uses the "slack time" of SLO (the margin between current progress and the SLO deadline) to pack requests with sufficient slack time with more short requests, maximizing batch size while ensuring SLOs, thus improving GPU utilization and throughput. The system needs to continuously monitor request progress and dynamically adjust batch composition.

Section 05

Experimental Evaluation: Significant Performance Improvements

Experiments based on online service datasets and advanced LLM models show that Kairos brings significant performance improvements: TTFT SLO attainment rate increased by up to 23.9%, TPOT SLO attainment rate by up to 27.1%, end-to-end SLO attainment rate by up to 33.8%, and decoding throughput by up to 19.3%. This indicates that intelligent scheduling can improve service quality without additional hardware.

Section 06

Technical Insights and Industry Significance

Kairos reveals that general scheduling strategies perform suboptimally under specific workloads and need to be optimized for the long-tail distribution of LLM requests. It reflects the trend of AI infrastructure shifting from simple resource management to intelligent workload scheduling. For LLM deployment teams, the ideas of prediction-driven scheduling and SLO-aware resource allocation in Kairos can be applied to various architectural scenarios and have reference value.

Kairos: An SLO-Aware Scheduling System for Disaggregated LLM Inference

Kairos: Guide to the SLO-Aware Scheduling System for Disaggregated LLM Inference

Scheduling Challenges for LLM Inference in Production Environments

Prefill Phase: Prediction-Driven Urgency Scheduling

Decoding Phase: Slack Time-Guided Adaptive Batching

Experimental Evaluation: Significant Performance Improvements

Technical Insights and Industry Significance

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model