# VaSE: Value-Aware Stochastic KV Cache Eviction Strategy for Reasoning Models

> VaSE increases cache diversity by protecting large-value states and introducing randomness. Under 4x KV cache compression, the reasoning model achieves an average accuracy across six reasoning tasks that surpasses SOTA selection methods, outperforming the strongest eviction method by over 4%.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-02T17:16:33.000Z
- 最近活动: 2026-06-03T05:23:00.400Z
- 热度: 125.9
- 关键词: KV缓存, 推理模型, 缓存驱逐, 内存优化, Qwen3, 稀疏注意力
- 页面链接: https://www.zingnex.cn/en/forum/thread/vase-kv
- Canonical: https://www.zingnex.cn/forum/thread/vase-kv
- Markdown 来源: floors_fallback

---

## [Introduction] VaSE: Value-Aware Stochastic KV Cache Eviction Strategy Boosts Reasoning Model Performance

VaSE addresses the KV cache memory bottleneck caused by long-sequence outputs of reasoning models by proposing a value-aware stochastic KV cache eviction strategy. This strategy maintains reasoning coherence by protecting large-value states and increases cache diversity by introducing randomness. Under 4x KV cache compression, the reasoning model's average accuracy across six reasoning tasks surpasses SOTA selection methods, outperforming the strongest eviction method by over 4%, and it can be deployed without training.

## KV Cache Memory Challenges for Reasoning Models

Reasoning models improve accuracy through chain-of-thought, but long outputs lead to huge KV cache memory usage. Existing KV cache eviction methods reduce costs, but their performance is usually inferior to sparse attention schemes that retain full cache. How to compress KV cache while maintaining model performance is a key challenge currently.

## Core Design of the VaSE Method

VaSE consists of two core components:
1. **Value-aware component**: Identify and protect large-value states; retain the top 5-10% of large-value states by setting thresholds to ensure key reasoning clues are not evicted;
2. **Stochastic component**: Use Gumbel sampling to randomly select from evictable candidates with probability inversely proportional to importance, increasing cache diversity.
This method requires no training and acts as an attention mechanism wrapper layer to dynamically decide which KV pairs to retain.

## Experimental Validation of VaSE's Effectiveness

Experiments show that the Qwen3 model using VaSE achieves higher average accuracy across six reasoning tasks than SOTA selection methods with the same sparsity under 4x KV cache compression, outperforming the strongest eviction method by over 4%. Additionally, VaSE supports FlashAttention2 and can achieve static memory usage, which is crucial for production deployment.

## Practical Deployment Value of VaSE

VaSE has significant practical value: it can be applied immediately to any Transformer reasoning model without model retraining or architecture modification; the guarantee of static memory usage allows system administrators to accurately predict memory requirements and avoid OOM errors caused by input length changes.

## Future Research Directions

The paper proposes future research directions:
- Dynamic threshold adjustment: automatically determine the protection ratio based on input characteristics;
- Combining with quantization techniques to further compress cache size;
- Adaptive eviction strategy for multi-task scenarios.