Reading

VaSE: Value-Aware Stochastic KV Cache Eviction Strategy for Reasoning Models

VaSE increases cache diversity by protecting large-value states and introducing randomness. Under 4x KV cache compression, the reasoning model achieves an average accuracy across six reasoning tasks that surpasses SOTA selection methods, outperforming the strongest eviction method by over 4%.

KV缓存推理模型缓存驱逐内存优化Qwen3稀疏注意力

Published 2026-06-03 01:16Recent activity 2026-06-03 13:23Estimated read 4 min

VaSE: Value-Aware Stochastic KV Cache Eviction Strategy for Reasoning Models

Section 01

[Introduction] VaSE: Value-Aware Stochastic KV Cache Eviction Strategy Boosts Reasoning Model Performance

VaSE addresses the KV cache memory bottleneck caused by long-sequence outputs of reasoning models by proposing a value-aware stochastic KV cache eviction strategy. This strategy maintains reasoning coherence by protecting large-value states and increases cache diversity by introducing randomness. Under 4x KV cache compression, the reasoning model's average accuracy across six reasoning tasks surpasses SOTA selection methods, outperforming the strongest eviction method by over 4%, and it can be deployed without training.

Section 02

KV Cache Memory Challenges for Reasoning Models

Reasoning models improve accuracy through chain-of-thought, but long outputs lead to huge KV cache memory usage. Existing KV cache eviction methods reduce costs, but their performance is usually inferior to sparse attention schemes that retain full cache. How to compress KV cache while maintaining model performance is a key challenge currently.

Section 03

Core Design of the VaSE Method

VaSE consists of two core components:

Value-aware component: Identify and protect large-value states; retain the top 5-10% of large-value states by setting thresholds to ensure key reasoning clues are not evicted;
Stochastic component: Use Gumbel sampling to randomly select from evictable candidates with probability inversely proportional to importance, increasing cache diversity. This method requires no training and acts as an attention mechanism wrapper layer to dynamically decide which KV pairs to retain.

Section 04

Experimental Validation of VaSE's Effectiveness

Experiments show that the Qwen3 model using VaSE achieves higher average accuracy across six reasoning tasks than SOTA selection methods with the same sparsity under 4x KV cache compression, outperforming the strongest eviction method by over 4%. Additionally, VaSE supports FlashAttention2 and can achieve static memory usage, which is crucial for production deployment.

Section 05

Practical Deployment Value of VaSE

VaSE has significant practical value: it can be applied immediately to any Transformer reasoning model without model retraining or architecture modification; the guarantee of static memory usage allows system administrators to accurately predict memory requirements and avoid OOM errors caused by input length changes.

Section 06

Future Research Directions

The paper proposes future research directions:

Dynamic threshold adjustment: automatically determine the protection ratio based on input characteristics;
Combining with quantization techniques to further compress cache size;
Adaptive eviction strategy for multi-task scenarios.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49