Zing Forum

Reading

Taming the Unschedulable: A New Paradigm for Client-Side Scheduling of Black-Box LLM Inference

This paper proposes a three-layer client-side scheduling architecture that enables intelligent scheduling of black-box LLM APIs via coarse-grained token prediction, achieving 100% completion rate and deadline satisfaction rate without needing to know the provider's internal mechanisms.

大语言模型推理调度黑盒APItoken预测负载均衡SLO客户端优化系统架构
Published 2026-04-08 19:41Recent activity 2026-04-09 10:46Estimated read 7 min
Taming the Unschedulable: A New Paradigm for Client-Side Scheduling of Black-Box LLM Inference
1

Section 01

Introduction: Taming Black-Box LLM Inference with a New Client-Side Scheduling Paradigm

This paper addresses the challenges of scheduling black-box LLM APIs by proposing a three-layer client-side scheduling architecture. It achieves semi-omniscient scheduling via coarse-grained token prediction, attaining 100% completion rate and deadline satisfaction rate without knowing the provider's internal mechanisms, while balancing fairness and robustness.

2

Section 02

Background: Scheduling Challenges of Black-Box LLM APIs

With the popularity of LLM API services, users rely on third-party inference services, but the provider's internal mechanisms are completely invisible. Traditional scheduling depends on system internal states (queue length, resource utilization, etc.), but this information is missing in black-box scenarios, leading to scheduling difficulties. Core question: Can efficient scheduling be achieved without knowing the system's internal details?

3

Section 03

Semi-Omniscient Scheduling: A Breakthrough Enabled by Token Prediction

Recent studies have found that the number of output tokens can be predicted at submission time, which brings client-side scheduling into the semi-omniscient era. Even without knowing the provider's internal queue status, scheduling decisions can be made based on coarse-grained token prior information (workload). Analogy to a restaurant: Although you can't see the kitchen, you can predict the preparation time of each dish to optimize the service order.

4

Section 04

Three-Layer Architecture: The Key to Decoupling Scheduling Problems

Researchers decompose scheduling into three independent optimization layers:

  1. Allocation Layer: Uses an adaptive Deficit Round Robin (DRR) algorithm to handle resource sharing among different request categories and dynamically adjust quotas.
  2. Sorting Layer: Uses a feasible set scoring method within categories to prioritize requests most likely to meet SLOs (similar to emergency triage).
  3. Overload Control Layer: Makes admission/delay/rejection decisions based on cost ladders to avoid worsening system overload.
5

Section 05

Experimental Evidence: Coarse-Grained Prediction and Performance

  • Information Ladder Experiment: Removing token volume information causes the P95 latency of short requests to inflate by 5.8x, and the deadline satisfaction rate drops significantly; relying only on category labels is insufficient—volume information is key.
  • Performance: In balanced/high congestion scenarios, it achieves 100% completion rate + 100% deadline satisfaction rate, with an effective throughput of 4.2±1.6 SLO requests per second. The P95 latency of short requests differs from quota-layered isolation by only tens of milliseconds.
  • Robustness: Even with a 60% multiplicative prediction error, the system degrades gracefully and is not sensitive to prediction quality.
6

Section 06

Fairness and Heavy Load Scenario Analysis

  • Fairness Trade-off: Fair Queuing (32% improvement in P90 latency for short requests, 17% increase in overhead for long requests) is more balanced than Short-Priority (27% improvement for short requests, 116% increase in overhead for long requests); the allocation layer can flexibly adapt to different fairness goals.
  • Heavy Load Scenarios: Strategies differ in three aspects: completion rate, tail latency, and explainable throttling; the cost ladder method makes overload decisions explainable, allowing users to understand why requests are delayed or rejected.
7

Section 07

Practical Insights: Key Principles for Client-Side Scheduling

  1. Core Investment in Token Prediction: Even rough output length predictions can bring significant scheduling benefits.
  2. Value of Layered Architecture: Separating allocation, sorting, and overload control makes it easy to understand and tune.
  3. Configurable Fairness: Choose allocation strategies based on business needs instead of a one-size-fits-all approach.
  4. Robustness First: Consider prediction errors during design to ensure graceful operation under imperfect conditions.
8

Section 08

Conclusion: The Shift from Unschedulable to Controllable

This study proves that efficient, fair, and explainable scheduling can be achieved in a black-box LLM API environment through semi-omniscient token prediction and a three-layer architecture. This is not only a technical breakthrough but also a mindset shift: using available information to find optimal solutions under constraints. As LLM services become more popular, client-side scheduling intelligence will become increasingly important.