# Clairvoyant: Mitigating Head-of-Line Blocking in Serial LLM Backends via Predictive SJF Scheduling

> Clairvoyant is a plug-and-play proxy for serial LLM backends. It implements predictive Shortest Job First (SJF) scheduling by predicting response lengths using an XGBoost classifier, reducing latency for short requests by 70-76% under high load.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-05T13:19:05.000Z
- 最近活动: 2026-06-08T03:30:38.047Z
- 热度: 82.8
- 关键词: LLM推理调度, 队首阻塞, 最短作业优先, 响应长度预测, 边缘部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/clairvoyant-sjfllm
- Canonical: https://www.zingnex.cn/forum/thread/clairvoyant-sjfllm
- Markdown 来源: floors_fallback

---

## Clairvoyant: Mitigating Head-of-Line Blocking in Serial LLM Backends via Predictive SJF Scheduling (Introduction)

# Clairvoyant: Mitigating Head-of-Line Blocking in Serial LLM Backends via Predictive SJF Scheduling (Introduction)
Clairvoyant is a plug-and-play proxy for serial LLM backends (such as Ollama, llama.cpp). It implements predictive Shortest Job First (SJF) scheduling by predicting response lengths using an XGBoost classifier, solving the head-of-line blocking problem under high load and reducing latency for short requests by 70-76% in high-load scenarios.
**Original Author/Maintainer**: Clairvoyant Research Team
**Source**: arXiv (published on June 5, 2026, link: http://arxiv.org/abs/2606.07248v1)

## Problem Background and Limitations of Existing Solutions

## Problem Background and Limitations of Existing Solutions
Serial LLM backends (like Ollama, llama.cpp) use First-Come-First-Served (FCFS) scheduling, which works well under light load but causes severe head-of-line blocking in high-load mixed workloads: short requests have to wait for long text generation tasks.
Limitations of existing solutions:
- Continuous batching (e.g., vLLM) requires large VRAM to store KV caches, making it unsuitable for edge devices;
- Preemptive scheduling needs complex context saving and restoration mechanisms;
- Heuristic classification (e.g., estimating output from input length) lacks sufficient accuracy.

## Core Methods of Clairvoyant

## Core Methods of Clairvoyant
1. **Predictive SJF Scheduling**: Predict output length based on input features and prioritize short tasks (relies on relative ordering, no need for absolute precision);
2. **Lightweight Feature Extraction**: 19 lexical features (input length statistics, language features, template structure, etc.), extracted in microseconds;
3. **XGBoost Classifier**: Efficient gradient boosting tree, exported to ONNX format, with a prediction latency of only 0.029 milliseconds (negligible).

## Key Finding: Importance of Natural Conversation Data

## Key Finding: Importance of Natural Conversation Data
- **Instruction Dataset Degradation**: Due to conciseness constraints, the proportion of long responses is extremely low (<0.02%), and class imbalance prevents the model from effectively distinguishing between long and short requests;
- **Value of Natural Conversation Logs**: Real user conversation records have a balanced distribution of long and short requests, making them an effective training data source.

## Experimental Evaluation Results

## Experimental Evaluation Results
- **Prediction Accuracy**: 62-96% accuracy on in-distribution test sets, 52-66% on cross-distribution test sets (demonstrates generalization ability);
- **End-to-End Performance**: On RTX4090, P50 latency for short requests is reduced by 70-76% under 100 concurrent requests, and by 17% under steady load (ρ=0.74).

## Deployment and Usage Features

## Deployment and Usage Features
- **Plug-and-Play**: Independent proxy service, no need to modify the underlying inference backend, compatible with OpenAI API;
- **Open-Source**: Supports free use, modification, and extension;
- **Low Resource Requirements**: Lightweight prediction model, can run on the same machine as the backend or on lightweight instances.

## Summary and Future Directions

## Summary and Future Directions
**Summary**: Clairvoyant implements SJF scheduling via lightweight response length prediction, effectively mitigating head-of-line blocking in serial LLM backends and having important practical value for edge deployment;
**Limitations**: Prediction relies on lexical features (difficult to capture complex semantics), scheduling strategy is simple (no consideration of priorities/user levels), and multi-backend support is limited;
**Future Directions**: Optimize the prediction model, explore complex scheduling strategies, and expand the scope of backend support.