Zing Forum

Reading

Clairvoyant: Mitigating Head-of-Line Blocking in Serial LLM Backends via Predictive SJF Scheduling

Clairvoyant is a plug-and-play proxy for serial LLM backends. It implements predictive Shortest Job First (SJF) scheduling by predicting response lengths using an XGBoost classifier, reducing latency for short requests by 70-76% under high load.

LLM推理调度队首阻塞最短作业优先响应长度预测边缘部署
Published 2026-06-05 21:19Recent activity 2026-06-08 11:30Estimated read 6 min
Clairvoyant: Mitigating Head-of-Line Blocking in Serial LLM Backends via Predictive SJF Scheduling
1

Section 01

Clairvoyant: Mitigating Head-of-Line Blocking in Serial LLM Backends via Predictive SJF Scheduling (Introduction)

Clairvoyant: Mitigating Head-of-Line Blocking in Serial LLM Backends via Predictive SJF Scheduling (Introduction)

Clairvoyant is a plug-and-play proxy for serial LLM backends (such as Ollama, llama.cpp). It implements predictive Shortest Job First (SJF) scheduling by predicting response lengths using an XGBoost classifier, solving the head-of-line blocking problem under high load and reducing latency for short requests by 70-76% in high-load scenarios. Original Author/Maintainer: Clairvoyant Research Team Source: arXiv (published on June 5, 2026, link: http://arxiv.org/abs/2606.07248v1)

2

Section 02

Problem Background and Limitations of Existing Solutions

Problem Background and Limitations of Existing Solutions

Serial LLM backends (like Ollama, llama.cpp) use First-Come-First-Served (FCFS) scheduling, which works well under light load but causes severe head-of-line blocking in high-load mixed workloads: short requests have to wait for long text generation tasks. Limitations of existing solutions:

  • Continuous batching (e.g., vLLM) requires large VRAM to store KV caches, making it unsuitable for edge devices;
  • Preemptive scheduling needs complex context saving and restoration mechanisms;
  • Heuristic classification (e.g., estimating output from input length) lacks sufficient accuracy.
3

Section 03

Core Methods of Clairvoyant

Core Methods of Clairvoyant

  1. Predictive SJF Scheduling: Predict output length based on input features and prioritize short tasks (relies on relative ordering, no need for absolute precision);
  2. Lightweight Feature Extraction: 19 lexical features (input length statistics, language features, template structure, etc.), extracted in microseconds;
  3. XGBoost Classifier: Efficient gradient boosting tree, exported to ONNX format, with a prediction latency of only 0.029 milliseconds (negligible).
4

Section 04

Key Finding: Importance of Natural Conversation Data

Key Finding: Importance of Natural Conversation Data

  • Instruction Dataset Degradation: Due to conciseness constraints, the proportion of long responses is extremely low (<0.02%), and class imbalance prevents the model from effectively distinguishing between long and short requests;
  • Value of Natural Conversation Logs: Real user conversation records have a balanced distribution of long and short requests, making them an effective training data source.
5

Section 05

Experimental Evaluation Results

Experimental Evaluation Results

  • Prediction Accuracy: 62-96% accuracy on in-distribution test sets, 52-66% on cross-distribution test sets (demonstrates generalization ability);
  • End-to-End Performance: On RTX4090, P50 latency for short requests is reduced by 70-76% under 100 concurrent requests, and by 17% under steady load (ρ=0.74).
6

Section 06

Deployment and Usage Features

Deployment and Usage Features

  • Plug-and-Play: Independent proxy service, no need to modify the underlying inference backend, compatible with OpenAI API;
  • Open-Source: Supports free use, modification, and extension;
  • Low Resource Requirements: Lightweight prediction model, can run on the same machine as the backend or on lightweight instances.
7

Section 07

Summary and Future Directions

Summary and Future Directions

Summary: Clairvoyant implements SJF scheduling via lightweight response length prediction, effectively mitigating head-of-line blocking in serial LLM backends and having important practical value for edge deployment; Limitations: Prediction relies on lexical features (difficult to capture complex semantics), scheduling strategy is simple (no consideration of priorities/user levels), and multi-backend support is limited; Future Directions: Optimize the prediction model, explore complex scheduling strategies, and expand the scope of backend support.