# Kairos: An SLO-Aware Scheduling System for Disaggregated LLM Inference

> This article introduces the Kairos scheduling system, which addresses the SLO attainment issue caused by the long-tail distribution of request lengths in disaggregated LLM inference architectures through urgency-first scheduling and slack-guided adaptive batching mechanisms.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-04T08:29:47.000Z
- 最近活动: 2026-05-05T03:22:42.595Z
- 热度: 123.1
- 关键词: LLM推理, 分离式架构, SLO调度, TTFT, TPOT, 请求调度, 连续批处理, 长尾分布
- 页面链接: https://www.zingnex.cn/en/forum/thread/kairos-llmslo
- Canonical: https://www.zingnex.cn/forum/thread/kairos-llmslo
- Markdown 来源: floors_fallback

---

## Kairos: Guide to the SLO-Aware Scheduling System for Disaggregated LLM Inference

This article introduces the Kairos scheduling system, which targets the SLO attainment problem caused by the long-tail distribution of request lengths in disaggregated LLM inference architectures. It optimizes two key SLO metrics—TTFT (Time to First Token) and TPOT (Time per Output Token)—through urgency-first scheduling in the prefill phase and slack-guided adaptive batching in the decoding phase, significantly improving SLO attainment rate and system throughput.

## Scheduling Challenges for LLM Inference in Production Environments

In LLM production deployment, meeting strict SLOs is a core challenge. LLM inference request lengths follow a long-tail distribution. In disaggregated architectures: long requests in the prefill phase cause head-of-line blocking; slow requests in the decoding phase lead to underutilization of resources. Existing FCFS (prefill) and continuous batching (decoding) strategies lack adaptability to LLM-specific workloads, resulting in compromised SLO attainment and suboptimal throughput.

## Prefill Phase: Prediction-Driven Urgency Scheduling

Kairos uses an urgency-first scheduling strategy in the prefill phase. Traditional FCFS causes long requests to block short ones, but Kairos predicts the prefill completion time of requests and prioritizes those that can finish within the TTFT SLO deadline, maximizing the TTFT SLO attainment rate. This strategy relies on a cost model based on request features (input length, model configuration, etc.) to estimate prefill time; even if predictions are not 100% accurate, it can significantly improve scheduling performance.

## Decoding Phase: Slack Time-Guided Adaptive Batching

In the decoding phase, Kairos proposes a slack-guided adaptive batching strategy. In continuous batching, slow requests slow down the entire batch. Kairos uses the "slack time" of SLO (the margin between current progress and the SLO deadline) to pack requests with sufficient slack time with more short requests, maximizing batch size while ensuring SLOs, thus improving GPU utilization and throughput. The system needs to continuously monitor request progress and dynamically adjust batch composition.

## Experimental Evaluation: Significant Performance Improvements

Experiments based on online service datasets and advanced LLM models show that Kairos brings significant performance improvements: TTFT SLO attainment rate increased by up to 23.9%, TPOT SLO attainment rate by up to 27.1%, end-to-end SLO attainment rate by up to 33.8%, and decoding throughput by up to 19.3%. This indicates that intelligent scheduling can improve service quality without additional hardware.

## Technical Insights and Industry Significance

Kairos reveals that general scheduling strategies perform suboptimally under specific workloads and need to be optimized for the long-tail distribution of LLM requests. It reflects the trend of AI infrastructure shifting from simple resource management to intelligent workload scheduling. For LLM deployment teams, the ideas of prediction-driven scheduling and SLO-aware resource allocation in Kairos can be applied to various architectural scenarios and have reference value.
