# TIE: Optimization of LLM Inference Scheduling Based on Uncertainty-Aware Output Length Prediction

> TIE is an open-source project from an ICML 2026 paper. It optimizes LLM inference scheduling by predicting the uncertainty of output lengths, effectively reducing GPU idle waiting time and improving inference throughput.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-26T10:14:01.000Z
- 最近活动: 2026-05-26T10:21:24.070Z
- 热度: 141.9
- 关键词: LLM推理, 调度优化, ICML, vLLM, 输出长度预测, 不确定性, GPU优化, 批处理
- 页面链接: https://www.zingnex.cn/en/forum/thread/tie-llm-3cd5c165
- Canonical: https://www.zingnex.cn/forum/thread/tie-llm-3cd5c165
- Markdown 来源: floors_fallback

---

## [Introduction] TIE: Uncertainty-Aware Output Length Prediction for Optimizing LLM Inference Scheduling

TIE is an open-source project from an ICML 2026 paper. Addressing the problem of GPU idling caused by output length variations in LLM batch inference, it proposes an uncertainty-aware output length prediction method to optimize inference scheduling, effectively reducing GPU idle waiting time and improving inference throughput. The project is implemented based on the vLLM framework, with the open-source address at https://github.com/Hyzheng-code/TIE.

## Problem Background: The Challenge of Resource Wastage in LLM Inference Batch Processing

In the batch processing scenario of LLM inference services, the output lengths of different requests vary significantly (from dozens to thousands of tokens). Traditional scheduling strategies assume similar output lengths or process requests sequentially, leading to short requests waiting for long ones to complete, GPU computing resources being idle, and user waiting times being prolonged.

## Core Idea and Technical Architecture of TIE

The core of TIE is uncertainty-aware output length prediction: it not only predicts the expected value of output length but also its uncertainty distribution (log-normal distribution). The technical architecture is based on the vLLM framework and includes four main components:
1. TIE Predictor (ua_predictor.py): Uses a DeBERTa encoder to predict the log mean and standard deviation of output length;
2. Score Calculator (ua_score_calculator.py): Computes scheduling scores based on predicted length and uncertainty, downweighting requests with high uncertainty;
3. Request Queue (request_queue.py): Prioritizes scheduling requests with deterministic predictions and efficient GPU utilization;
4. Scheduler (scheduler.py): Extends the core scheduling logic of vLLM v1 and integrates the above components.

## Implementation and Deployment Details

- Implementation: Written in Python, training code is in the train directory, and inference is integrated into the vllm/v1/core/sched directory; requires configuration of predictor checkpoints, pre-trained encoders, and the path to training data CSV (including three columns: prompt, logt_mu, logt_sigma).
- Deployment: Started via start-server.sh, specifying scheduling strategy, GPU devices, and model path; when using the ua strategy, reserve one GPU to run the predictor, and use the rest for tensor parallel inference to avoid prediction overhead affecting performance.

## Academic Contributions and Practical Significance

- Academic: TIE was accepted by ICML 2026, introducing a machine learning-driven prediction mechanism to promote the evolution of LLM inference systems toward intelligent scheduling.
- Practical: Output length prediction can be applied to scenarios such as dynamic batch size adjustment, KV cache pre-allocation, user waiting time estimation, and resource quota planning.

## Usage Recommendations and Summary

- Usage Recommendations: Suitable for deployment by teams with sufficient historical request data; need to evaluate the additional computational overhead of the predictor to ensure that benefits outweigh costs.
- Summary: TIE introduces uncertainty quantification into LLM inference scheduling to solve resource wastage caused by output length variations, and it is easy to integrate based on the vLLM framework; as LLM applications expand, research on optimizing such system bottlenecks will become more important.
