Section 01
【Introduction】Speculative Decoding Latency Model: A Practical Framework for LLM Inference Acceleration in Production
This paper proposes an interpretable speculative decoding latency model. Using Little's Law to infer the effective batch size, it decomposes request latency into load-independent and load-dependent components across prefill, draft generation, and verification stages. It explains why the acceleration effect of speculative decoding weakens as server load increases and provides guidance for production environment configuration. This model fills the gap in existing research that ignores the dynamic characteristics of systems, helping engineers scientifically configure parameters to improve LLM inference performance.