# Taming the Unschedulable: A New Paradigm for Client-Side Scheduling of Black-Box LLM Inference

> This paper proposes a three-layer client-side scheduling architecture that enables intelligent scheduling of black-box LLM APIs via coarse-grained token prediction, achieving 100% completion rate and deadline satisfaction rate without needing to know the provider's internal mechanisms.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-08T11:41:21.000Z
- 最近活动: 2026-04-09T02:46:40.036Z
- 热度: 144.9
- 关键词: 大语言模型, 推理调度, 黑盒API, token预测, 负载均衡, SLO, 客户端优化, 系统架构
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-06970v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-06970v1
- Markdown 来源: floors_fallback

---

## Introduction: Taming Black-Box LLM Inference with a New Client-Side Scheduling Paradigm

This paper addresses the challenges of scheduling black-box LLM APIs by proposing a three-layer client-side scheduling architecture. It achieves semi-omniscient scheduling via coarse-grained token prediction, attaining 100% completion rate and deadline satisfaction rate without knowing the provider's internal mechanisms, while balancing fairness and robustness.

## Background: Scheduling Challenges of Black-Box LLM APIs

With the popularity of LLM API services, users rely on third-party inference services, but the provider's internal mechanisms are completely invisible. Traditional scheduling depends on system internal states (queue length, resource utilization, etc.), but this information is missing in black-box scenarios, leading to scheduling difficulties. Core question: Can efficient scheduling be achieved without knowing the system's internal details?

## Semi-Omniscient Scheduling: A Breakthrough Enabled by Token Prediction

Recent studies have found that the number of output tokens can be predicted at submission time, which brings client-side scheduling into the semi-omniscient era. Even without knowing the provider's internal queue status, scheduling decisions can be made based on coarse-grained token prior information (workload). Analogy to a restaurant: Although you can't see the kitchen, you can predict the preparation time of each dish to optimize the service order.

## Three-Layer Architecture: The Key to Decoupling Scheduling Problems

Researchers decompose scheduling into three independent optimization layers:
1. **Allocation Layer**: Uses an adaptive Deficit Round Robin (DRR) algorithm to handle resource sharing among different request categories and dynamically adjust quotas.
2. **Sorting Layer**: Uses a feasible set scoring method within categories to prioritize requests most likely to meet SLOs (similar to emergency triage).
3. **Overload Control Layer**: Makes admission/delay/rejection decisions based on cost ladders to avoid worsening system overload.

## Experimental Evidence: Coarse-Grained Prediction and Performance

- **Information Ladder Experiment**: Removing token volume information causes the P95 latency of short requests to inflate by 5.8x, and the deadline satisfaction rate drops significantly; relying only on category labels is insufficient—volume information is key.
- **Performance**: In balanced/high congestion scenarios, it achieves 100% completion rate + 100% deadline satisfaction rate, with an effective throughput of 4.2±1.6 SLO requests per second. The P95 latency of short requests differs from quota-layered isolation by only tens of milliseconds.
- **Robustness**: Even with a 60% multiplicative prediction error, the system degrades gracefully and is not sensitive to prediction quality.

## Fairness and Heavy Load Scenario Analysis

- **Fairness Trade-off**: Fair Queuing (32% improvement in P90 latency for short requests, 17% increase in overhead for long requests) is more balanced than Short-Priority (27% improvement for short requests, 116% increase in overhead for long requests); the allocation layer can flexibly adapt to different fairness goals.
- **Heavy Load Scenarios**: Strategies differ in three aspects: completion rate, tail latency, and explainable throttling; the cost ladder method makes overload decisions explainable, allowing users to understand why requests are delayed or rejected.

## Practical Insights: Key Principles for Client-Side Scheduling

1. **Core Investment in Token Prediction**: Even rough output length predictions can bring significant scheduling benefits.
2. **Value of Layered Architecture**: Separating allocation, sorting, and overload control makes it easy to understand and tune.
3. **Configurable Fairness**: Choose allocation strategies based on business needs instead of a one-size-fits-all approach.
4. **Robustness First**: Consider prediction errors during design to ensure graceful operation under imperfect conditions.

## Conclusion: The Shift from Unschedulable to Controllable

This study proves that efficient, fair, and explainable scheduling can be achieved in a black-box LLM API environment through semi-omniscient token prediction and a three-layer architecture. This is not only a technical breakthrough but also a mindset shift: using available information to find optimal solutions under constraints. As LLM services become more popular, client-side scheduling intelligence will become increasingly important.
