# DriftSched: An Adaptive QoS-Aware Scheduling Framework for Multi-Tenant LLM Inference

> DriftSched is an innovative scheduling framework designed to address the token drift problem in large language model (LLM) inference under multi-tenant environments, optimizing inference performance and resource utilization through an adaptive QoS-aware mechanism.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-02T21:38:08.000Z
- 最近活动: 2026-06-02T21:51:28.613Z
- 热度: 139.8
- 关键词: LLM推理, 多租户调度, QoS感知, Token漂移, 自适应调度, GPU优化, 推理服务
- 页面链接: https://www.zingnex.cn/en/forum/thread/driftsched-llmqos
- Canonical: https://www.zingnex.cn/forum/thread/driftsched-llmqos
- Markdown 来源: floors_fallback

---

## DriftSched: An Adaptive QoS-Aware Scheduling Framework for Multi-Tenant LLM Inference

DriftSched is an innovative scheduling framework designed to address the token drift problem in large language model (LLM) inference under multi-tenant environments, optimizing inference performance and resource utilization through an adaptive QoS-aware mechanism. This article will introduce it from aspects such as background, architecture, strategy, experiments, and value.

## Problem Background: Challenges Posed by Token Drift

In multi-tenant LLM inference services, the phenomenon where the actual token consumption of a request deviates significantly from expectations is called token drift. Due to the autoregressive nature of LLMs, output length is difficult to predict, leading to inaccurate resource estimation, fluctuating GPU utilization, and degraded performance for high-priority tenants. Traditional FCFS (First-Come-First-Served) or simple priority strategies struggle to handle this dynamic uncertainty.

## Core Architecture Components of DriftSched

DriftSched consists of four key components:
1. Adaptive Token Estimator: Dynamically predicts token consumption by combining historical requests, system load, and prompt features;
2. Multi-level Priority Scheduling Queue: Supports priority aging mechanism to balance fairness and QoS guarantee;
3. GPU Inference Worker: Encapsulates inference execution logic, optimizing memory management and batch processing;
4. API Gateway: Unified request access, responsible for traffic shaping, authentication, and routing.

## Innovations of QoS-Aware Scheduling Strategy

DriftSched incorporates QoS metrics (latency performance, SLA achievement rate, resource usage pattern) into scheduling decisions, dynamically adjusting scheduling weights. When the token drift rate of a tenant rises abnormally, it automatically allocates more elastic resources or adjusts queue positions to compensate for additional latency. Unlike static isolation schemes, it achieves fine-grained resource management through feedback-driven adaptive strategies.

## Experiment and Evaluation Framework

The project provides the `run_experiment.sh` script to support performance comparison of different scheduling strategies, and the `prompts_dataset.py` module to generate/load test prompt datasets, ensuring that the evaluation reflects real workload characteristics.

## Technical Value and Application Prospects

DriftSched provides a systematic solution for LLM inference resource scheduling, which can serve as a reference implementation for the scheduling layer of enterprise internal LLM platforms. Its QoS-aware idea can be extended to heterogeneous computing environments. As enterprise-level LLM deployments expand, such frameworks are crucial for ensuring multi-tenant QoS, representing the evolution direction of LLM infrastructure from 'being able to run' to 'running well'.
