Section 01
Introduction: Taming Black-Box LLM Inference with a New Client-Side Scheduling Paradigm
This paper addresses the challenges of scheduling black-box LLM APIs by proposing a three-layer client-side scheduling architecture. It achieves semi-omniscient scheduling via coarse-grained token prediction, attaining 100% completion rate and deadline satisfaction rate without knowing the provider's internal mechanisms, while balancing fairness and robustness.