Reading

Can KV Cache Save Long-Range Speculative Decoding? A New Perspective on the Hidden State Drift Problem

This article proposes the KV-Reuse hypothesis, which aims to mitigate the accuracy degradation in long-range speculative decoding by having the draft model reuse the target model's KV cache instead of hidden states, and open-sources the KVShot diagnostic framework.

speculative decodingKV cacheLLM inferenceinference optimizationQwen3test-time traininghidden statesdraft model

Published 2026-04-29 16:25Recent activity 2026-04-30 10:50Estimated read 7 min

Section 01

Main Floor: Can KV Cache Save Long-Range Speculative Decoding? A New Perspective on the Hidden State Drift Problem

This article discusses the long-range degradation problem in speculative decoding of large language models, and proposes the KV-Reuse hypothesis: having the draft model reuse the target model's KV cache instead of hidden states to mitigate accuracy degradation; it also open-sources the KVShot diagnostic framework to validate the hypothesis. Key findings include: hidden state reuse has information compression bias, while KV cache retains more complete context; KV reuse can improve long-range speculative acceptance rate, but faces two major bottlenecks—difficulty in query estimation for shallow draft models and sparse gradients in KV projection; breakthroughs are needed in directions like block-level training, providing insights for next-generation inference architectures.

Section 02

Background: Acceleration of Speculative Decoding and the Dilemma of Long-Range Degradation

Speculative decoding is a technique that achieves 2-3x acceleration without modifying the model: a small draft model quickly generates candidate tokens, and a large target model verifies them in parallel. However, it has a fatal flaw—long-range degradation: as the number of speculative steps increases, the prediction accuracy of the draft model drops sharply, the verification pass rate decreases, and the acceleration effect is severely weakened.

Section 03

Root Cause of the Problem: Information Compression Bias in Hidden States

Traditional speculative decoding reuses the target model's hidden states as context, which has fundamental information loss. Hidden states are biased context compressions that prioritize retaining information most relevant to the current query and suppress background information needed for subsequent speculation. This short-sighted strategy leads to cumulative information loss, and prediction quality degrades as steps increase.

Section 04

Method: Proposal of the KV-Reuse Hypothesis

KV cache retains complete token-level key-value representations without the limitation of a single compressed vector, making it an explicit context storage. Based on this, the KV-Reuse hypothesis is proposed: the draft model directly reuses the target model's KV cache, which can significantly improve the long-range speculative acceptance rate. The intuition is that KV cache retains complete attention information for each position, allowing the draft model to flexibly retrieve information from different positions.

Section 05

Evidence: Experimental Validation with the KVShot Framework

The KVShot diagnostic framework was developed to compare three reuse paradigms: Hidden-only (traditional), KV-only (KV cache only), and Hybrid (mixed). Experiments on Qwen3-8B confirm that KV reuse improves the long-range speculative acceptance rate, but the end-to-end overall acceleration effect is limited, leading to a deep question: why hasn't better long-range prediction translated into significant acceleration gains?

Section 06

Findings: Two Major Structural Bottlenecks Faced by KV Reuse

Analysis reveals that KV-aware decoding faces two major bottlenecks: 1. Shallow draft models struggle to accurately estimate the deep and complex query vectors of the target model, affecting KV cache retrieval effectiveness; 2. The gradient signals in the KV projection layer are sparse and lack direct supervision, making it difficult for the draft model to generate KV representations compatible with the target model.

Section 07

Suggestion: Shift to Block-Level Training Paradigm

Existing Test-Time Training (TTT) cannot solve the long-range degradation problem, as the root cause lies in structural bottlenecks. We need to go beyond TTT and shift to a block-level training paradigm: allowing the model to encounter multi-step speculative targets during training, providing richer gradient signals for the KV projection layer, and fundamentally improving the draft model's ability to generate high-quality KV representations.

Section 08

Implications: Design Directions for Next-Generation Inference Architectures

The KVShot framework points the way for next-generation inference architectures: 1. Upgrade the draft model architecture to design lightweight models that can better estimate target queries; 2. Develop KV-aware training objectives to provide denser supervision signals; 3. Explore optimal fusion strategies between hidden states and KV cache; 4. Optimize hardware-software collaboration to address the memory bandwidth requirements of KV reuse.

Can KV Cache Save Long-Range Speculative Decoding? A New Perspective on the Hidden State Drift Problem

Main Floor: Can KV Cache Save Long-Range Speculative Decoding? A New Perspective on the Hidden State Drift Problem

Background: Acceleration of Speculative Decoding and the Dilemma of Long-Range Degradation

Root Cause of the Problem: Information Compression Bias in Hidden States

Method: Proposal of the KV-Reuse Hypothesis

Evidence: Experimental Validation with the KVShot Framework

Findings: Two Major Structural Bottlenecks Faced by KV Reuse

Suggestion: Shift to Block-Level Training Paradigm

Implications: Design Directions for Next-Generation Inference Architectures

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model