Section 01
Kairos: Guide to the SLO-Aware Scheduling System for Disaggregated LLM Inference
This article introduces the Kairos scheduling system, which targets the SLO attainment problem caused by the long-tail distribution of request lengths in disaggregated LLM inference architectures. It optimizes two key SLO metrics—TTFT (Time to First Token) and TPOT (Time per Output Token)—through urgency-first scheduling in the prefill phase and slack-guided adaptive batching in the decoding phase, significantly improving SLO attainment rate and system throughput.