# Can KV Cache Save Long-Range Speculative Decoding? A New Perspective on the Hidden State Drift Problem

> This article proposes the KV-Reuse hypothesis, which aims to mitigate the accuracy degradation in long-range speculative decoding by having the draft model reuse the target model's KV cache instead of hidden states, and open-sources the KVShot diagnostic framework.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-29T08:25:01.000Z
- 最近活动: 2026-04-30T02:50:47.997Z
- 热度: 141.6
- 关键词: speculative decoding, KV cache, LLM inference, inference optimization, Qwen3, test-time training, hidden states, draft model
- 页面链接: https://www.zingnex.cn/en/forum/thread/kv-hidden-state
- Canonical: https://www.zingnex.cn/forum/thread/kv-hidden-state
- Markdown 来源: floors_fallback

---

## Main Floor: Can KV Cache Save Long-Range Speculative Decoding? A New Perspective on the Hidden State Drift Problem

This article discusses the long-range degradation problem in speculative decoding of large language models, and proposes the KV-Reuse hypothesis: having the draft model reuse the target model's KV cache instead of hidden states to mitigate accuracy degradation; it also open-sources the KVShot diagnostic framework to validate the hypothesis. Key findings include: hidden state reuse has information compression bias, while KV cache retains more complete context; KV reuse can improve long-range speculative acceptance rate, but faces two major bottlenecks—difficulty in query estimation for shallow draft models and sparse gradients in KV projection; breakthroughs are needed in directions like block-level training, providing insights for next-generation inference architectures.

## Background: Acceleration of Speculative Decoding and the Dilemma of Long-Range Degradation

Speculative decoding is a technique that achieves 2-3x acceleration without modifying the model: a small draft model quickly generates candidate tokens, and a large target model verifies them in parallel. However, it has a fatal flaw—long-range degradation: as the number of speculative steps increases, the prediction accuracy of the draft model drops sharply, the verification pass rate decreases, and the acceleration effect is severely weakened.

## Root Cause of the Problem: Information Compression Bias in Hidden States

Traditional speculative decoding reuses the target model's hidden states as context, which has fundamental information loss. Hidden states are biased context compressions that prioritize retaining information most relevant to the current query and suppress background information needed for subsequent speculation. This short-sighted strategy leads to cumulative information loss, and prediction quality degrades as steps increase.

## Method: Proposal of the KV-Reuse Hypothesis

KV cache retains complete token-level key-value representations without the limitation of a single compressed vector, making it an explicit context storage. Based on this, the KV-Reuse hypothesis is proposed: the draft model directly reuses the target model's KV cache, which can significantly improve the long-range speculative acceptance rate. The intuition is that KV cache retains complete attention information for each position, allowing the draft model to flexibly retrieve information from different positions.

## Evidence: Experimental Validation with the KVShot Framework

The KVShot diagnostic framework was developed to compare three reuse paradigms: Hidden-only (traditional), KV-only (KV cache only), and Hybrid (mixed). Experiments on Qwen3-8B confirm that KV reuse improves the long-range speculative acceptance rate, but the end-to-end overall acceleration effect is limited, leading to a deep question: why hasn't better long-range prediction translated into significant acceleration gains?

## Findings: Two Major Structural Bottlenecks Faced by KV Reuse

Analysis reveals that KV-aware decoding faces two major bottlenecks: 1. Shallow draft models struggle to accurately estimate the deep and complex query vectors of the target model, affecting KV cache retrieval effectiveness; 2. The gradient signals in the KV projection layer are sparse and lack direct supervision, making it difficult for the draft model to generate KV representations compatible with the target model.

## Suggestion: Shift to Block-Level Training Paradigm

Existing Test-Time Training (TTT) cannot solve the long-range degradation problem, as the root cause lies in structural bottlenecks. We need to go beyond TTT and shift to a block-level training paradigm: allowing the model to encounter multi-step speculative targets during training, providing richer gradient signals for the KV projection layer, and fundamentally improving the draft model's ability to generate high-quality KV representations.

## Implications: Design Directions for Next-Generation Inference Architectures

The KVShot framework points the way for next-generation inference architectures: 1. Upgrade the draft model architecture to design lightweight models that can better estimate target queries; 2. Develop KV-aware training objectives to provide denser supervision signals; 3. Explore optimal fusion strategies between hidden states and KV cache; 4. Optimize hardware-software collaboration to address the memory bandwidth requirements of KV reuse.
