Section 01
[Introduction] Anytime LLM Inference: An LLM Inference Optimization Framework Under Real-Time Constraints
This article presents the Anytime LLM Inference framework, which addresses the problem of uncertain latency in traditional LLM inference by introducing a confidence threshold mechanism and KV cache scheduling in the middle layers of the Transformer. It maximizes output quality while ensuring hard real-time deadlines are met, making it suitable for real-time scenarios such as clinical decision-making and autonomous driving.