Section 01
Main Floor: Can KV Cache Save Long-Range Speculative Decoding? A New Perspective on the Hidden State Drift Problem
This article discusses the long-range degradation problem in speculative decoding of large language models, and proposes the KV-Reuse hypothesis: having the draft model reuse the target model's KV cache instead of hidden states to mitigate accuracy degradation; it also open-sources the KVShot diagnostic framework to validate the hypothesis. Key findings include: hidden state reuse has information compression bias, while KV cache retains more complete context; KV reuse can improve long-range speculative acceptance rate, but faces two major bottlenecks—difficulty in query estimation for shallow draft models and sparse gradients in KV projection; breakthroughs are needed in directions like block-level training, providing insights for next-generation inference architectures.