Zing Forum

Reading

TIE Scheduler: Optimizing LLM Inference Scheduling with Uncertainty-Aware Prediction

In LLM inference scheduling, traditional methods use point estimation to predict output length, ignoring the randomness in the decoding process. Studies have found that output length follows a heavy-tailed distribution, which can be fitted with a log-t distribution. Based on this, the proposed TIE metric estimates the risk of long outputs by adjusting tail probabilities, achieving a 2.31x reduction in per-token latency for online inference and a 1.42x increase in throughput for offline batch processing.

LLM推理调度优化不确定性预测最短作业优先重尾分布对数t分布尾部膨胀期望吞吐量优化
Published 2026-04-01 13:31Recent activity 2026-04-02 09:53Estimated read 5 min
TIE Scheduler: Optimizing LLM Inference Scheduling with Uncertainty-Aware Prediction
1

Section 01

[Main Floor] TIE Scheduler: Core Guide to Uncertainty-Aware Optimization of LLM Inference Scheduling

The TIE scheduler addresses the problem in LLM inference scheduling where traditional point estimation ignores the randomness of output length. Through analysis, it was found that output length follows a heavy-tailed distribution (fittable with a log-t distribution), and the Tail Inflated Expectation (TIE) metric was proposed to adjust the risk of long outputs. Experimental results show a 2.31x reduction in per-token latency for online inference and a 1.42x increase in throughput for offline batch processing.

2

Section 02

[Background] Core Challenges of LLM Inference Scheduling and Limitations of the SJF Strategy

LLM inference services face latency and throughput bottlenecks. Request processing is divided into two stages: prefill and decode. Scheduling needs to balance latency and throughput, but output lengths vary greatly between requests, so FIFO is prone to head-of-line blocking. The SJF strategy prioritizes short jobs, but existing methods use point estimation to predict output length, which cannot capture the randomness of the generation process.

3

Section 03

[Methodology] Output Length Distribution Characteristics and TIE Metric Design

Studies have found that output length exhibits a heavy-tailed distribution, which can be fitted with a t-distribution after log transformation. Based on this, the TIE metric is proposed: it combines the distribution expectation and tail probability to upwardly adjust the risk of long outputs. TIE is compatible with the SJF framework, computationally efficient, and can be used online in real time.

4

Section 04

[Implementation] Key Engineering Points for TIE Scheduler Deployment

  1. Online prediction: A lightweight model outputs log-t distribution parameters based on prompt features; 2. Dynamic adjustment: Update output length estimates during generation; 3. Batch processing optimization: Combine requests with similar TIE values to reduce load imbalance; 4. Strategy combination: Collaborate with priority scheduling and preemption mechanisms.
5

Section 05

[Experiments] Performance Improvement Verification of the TIE Scheduler

Online inference scenario: TPOT reduced by 2.31x, head-of-line blocking reduced; Offline batch processing: Throughput increased by 1.42x, batch composition optimized; Compared to baselines (FIFO, point-estimation SJF, quantile SJF), it performs best and has good generalization in tasks such as dialogue and code generation.

6

Section 06

[Conclusion] Technical Contributions of the TIE Scheduler

  1. Reveals the random nature of output length in LLM inference; 2. Proves the importance of heavy-tailed distribution modeling for scheduling optimization; 3. Provides an efficient uncertainty quantification method, offering new ideas for system optimization.
7

Section 07

[Outlook] Limitations and Future Directions of the TIE Scheduler

Limitations: Relies on the log-t distribution assumption, prediction accuracy is affected by prompt complexity; Future directions: Explore more flexible distribution models, stronger predictors, multi-dimensional optimization (combining input length/priority), and hardware-aware scheduling.

8

Section 08

[Applications] Practical Value of the TIE Scheduler

Cloud service providers: Improve resource utilization and reduce costs; Enterprise users: Enhance interaction experience and support high concurrency; Researchers: Provide a reference for uncertainty scheduling problems. As LLM applications grow, uncertainty-aware optimization will become a key direction.