Zing Forum

Reading

TriAxialKV: A New Ultra-Low Precision KV Cache Quantization Scheme for Agent Reasoning Tasks

TriAxialKV proposes a tri-axial mixed-precision KV cache quantization method, which assigns INT2/INT4 precision to different tokens across three dimensions—temporal proximity, modality type, and semantic role—achieving 4.5x cache compression and a 30% throughput increase while maintaining accuracy.

KV缓存量化智能体推理混合精度大语言模型显存优化多模态OSWorld
Published 2026-05-17 05:58Recent activity 2026-05-19 11:47Estimated read 4 min
TriAxialKV: A New Ultra-Low Precision KV Cache Quantization Scheme for Agent Reasoning Tasks
1

Section 01

[Introduction] TriAxialKV: A New KV Cache Quantization Scheme for Agent Reasoning, 4.5x Compression + 30% Throughput Increase

TriAxialKV proposes a tri-axial mixed-precision KV cache quantization method for agent reasoning tasks. It assigns INT2/INT4 precision to different tokens across three dimensions—temporal proximity, modality type, and semantic role—achieving 4.5x KV cache compression and a 30% throughput increase while maintaining reasoning accuracy, effectively addressing the memory bottleneck in agent reasoning.

2

Section 02

Background: KV Cache Memory Bottleneck in Agent Reasoning

As large language models evolve into agents, reasoning tasks need to handle long contexts, multi-modal inputs, and multi-round tool calls, leading to a surge in KV cache memory demand. Traditional BF16-precision KV caches easily exhaust memory, and existing compression methods are mostly homogeneous or only leverage single-dimensional heterogeneity, failing to fully exploit the complex differences in token behavior in agent workloads.

3

Section 03

Core Insight: Tri-Axial Heterogeneity and Mixed-Precision Quantization Scheme

The TriAxialKV team found that token importance can be characterized from three dimensions: temporal proximity (recent tokens are more important), modality type (differences in characteristics between text and image tokens), and semantic role (varying contribution degrees of roles like user queries and tool calls). Based on this, they proposed a mixed-precision quantization scheme that assigns tri-axial labels to each token, and after calibration, allocates INT2/INT4 bit widths to balance memory usage and reasoning quality.

4

Section 04

End-to-End System Implementation: Three Core Components

TriAxialKV consists of three core components: 1. Calibration module: Analyzes token sensitivity distribution and establishes a mapping from labels to precision; 2. Mixed-precision quantization and memory management: Dynamically allocates precision and efficiently manages the cache; 3. Custom fused Triton decoding kernel: Optimizes GPU access patterns to ensure throughput improvement.

5

Section 05

Experimental Validation: Win-Win Results for Accuracy and Efficiency

Tested on the Qwen3-VL-32B-Thinking model and OSWorld agent tasks, TriAxialKV maintains the same accuracy as SGLang's BF16 KV cache, achieves a 4.5x cache compression ratio, and a 30% end-to-end throughput increase. This can help enterprises support more concurrency with the same hardware or reduce GPU usage.

6

Section 06

Technical Insights and Future Outlook

TriAxialKV brings three insights: 1. Deeply understanding workload characteristics is a prerequisite for optimization; 2. Joint modeling of multi-dimensional heterogeneity unlocks greater potential; 3. Close integration of algorithms and system implementation is key to deployment. In the future, such refined optimization schemes will lay the foundation for larger-scale agent applications.