Zing Forum

Reading

Anytime LLM Inference: A Real-Time Scheduling Framework for Constraining Inference Latency via Predictive Early Exit Mechanism

This article introduces an Anytime algorithm framework designed for large language model (LLM) inference. By incorporating a confidence threshold mechanism in the middle layers of the Transformer, it maximizes output quality while ensuring hard real-time deadlines are met.

LLM推理实时系统Anytime算法早退机制KV缓存延迟优化TransformerTinyLlama可调度性分析置信度阈值
Published 2026-04-23 09:09Recent activity 2026-04-23 09:19Estimated read 6 min
Anytime LLM Inference: A Real-Time Scheduling Framework for Constraining Inference Latency via Predictive Early Exit Mechanism
1

Section 01

[Introduction] Anytime LLM Inference: An LLM Inference Optimization Framework Under Real-Time Constraints

This article presents the Anytime LLM Inference framework, which addresses the problem of uncertain latency in traditional LLM inference by introducing a confidence threshold mechanism and KV cache scheduling in the middle layers of the Transformer. It maximizes output quality while ensuring hard real-time deadlines are met, making it suitable for real-time scenarios such as clinical decision-making and autonomous driving.

2

Section 02

Background: Latency Dilemma in Real-Time AI Inference

In interactive AI applications (e.g., clinical decision support, human-computer interaction, cyber-physical control systems), latency is a core metric. Traditional autoregressive LLM inference requires each token to pass through all Transformer layers, leading to unbounded worst-case execution time. Latency surges with long contexts or long responses, violating real-time constraints. How to provide predictable latency bounds while ensuring quality is a core challenge for real-time AI systems.

3

Section 03

Methodology: Core Mechanisms of the Anytime Framework

The Anytime framework is based on predictive signals from the hidden states of the Transformer's middle layers. Taking TinyLlama-1.1B-Chat as an example, the hidden state at layer 16 (out of 22) has a 32% consistency with the full-layer output (64.7% when confidence ≥0.5). A KV cache scheduler is implemented: if the confidence exceeds the threshold, early exit is triggered to ensure token generation is within the 45ms deadline. Layer-wise ablation experiments selected layer 16 as the default early exit point (balancing quality and efficiency). Two scheduling strategies are used: stateless dynamic scheduling (two-stage decision, suitable for short sequences) and KV cache single-stage scheduling (single forward pass, fixed threshold of 0.55, stable latency).

4

Section 04

Evidence: Real-Time Performance and Validation

Real-time analysis uses schedulability criteria (P99_TPOT ≤ D). In PubMedQA tests, the KV cache scheduler achieved an average TPOT of 20ms, P99 TPOT of 22ms, utilization of 0.488, and zero miss rate; the stateless scheduler's P99 TPOT of 48.3ms exceeded the deadline. Deadline scanning shows that the KV cache scheduler works stably when D ≥22ms. In clinical tests, the KV cache mode achieved an accuracy of 71.4% (extractable labels), label extraction rate of 46.7%, zero miss rate, and average TPOT of 19.5ms.

5

Section 05

Technical Implementation Details

The model is encapsulated via a custom EarlyExitTinyLlama class, supporting layer-wise forward control (e.g., exit_layer=16 for early exit). Key invariants: RMSNorm applied at exit points, shared rotary positional encoding, no in-place modifications. The KV cache path uses a forward hook to capture intermediate states at layer 15 (0-indexed), avoiding the problem of KV cache desynchronization in two-stage processes.

6

Section 06

Practical Implications and Limitations

Application scenarios include clinical decision-making, autonomous driving, industrial control, voice interaction, etc. The core value is providing latency predictability. Trade-off: adaptive balance between latency guarantees and output quality. Limitations: validated only on TinyLlama-1.1B, confidence thresholds set heuristically, limited instruction-following ability of small models (53% of responses were verbose in clinical tests).

7

Section 07

Conclusions and Insights

The Anytime framework applies real-time system methods (WCET analysis, schedulability proof) to LLM inference, proving that latency predictability can be achieved via algorithmic scheduling. It provides a reference for deploying LLMs in edge or real-time scenarios, showing that latency guarantees can be achieved through intelligent scheduling. This approach of balancing efficiency and quality is crucial for real-time AI systems.