# OSCAR: A Spectral Covariance-Aware Rotation Method for 2-bit KV Cache Quantization

> OSCAR derives rotation and cropping thresholds by offline estimating attention-aware covariance structures, achieving high-precision 2-bit KV cache quantization. It maintains BF16-level precision while enabling 8x memory compression and 7x throughput improvement.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-18T02:24:29.000Z
- 最近活动: 2026-05-19T02:57:13.639Z
- 热度: 115.5
- 关键词: KV缓存量化, 2-bit量化, 注意力机制, 协方差感知, 长上下文, LLM推理优化, 内存压缩
- 页面链接: https://www.zingnex.cn/en/forum/thread/oscar-2-bit-kv
- Canonical: https://www.zingnex.cn/forum/thread/oscar-2-bit-kv
- Markdown 来源: floors_fallback

---

## OSCAR: 2-bit KV Cache Quantization with Spectral Covariance-Aware Rotation (Introduction)

OSCAR (Offline Spectral Covariance-Aware Rotation) addresses long context LLM services' KV cache memory bottleneck via 2-bit quantization. It offline estimates attention-aware covariance structures to derive rotation and cropping thresholds, achieving 8x memory compression, up to 7x throughput improvement, and maintaining BF16-level precision. This work is critical for making long context LLM services economically feasible.

## Background: Long Context LLM's KV Cache Bottleneck & 2-bit Quantization Challenges

As LLM context windows expand to 128K+ tokens, KV cache memory usage becomes a key deployment bottleneck, limiting batch size and throughput. Quantization reduces memory but 2-bit (INT2) faces two core issues: 1) Simple methods cause sharp precision drops; 2) High-precision methods often require complex custom kernels, hard to integrate into existing frameworks.

## OSCAR's Core Idea: Attention-Aware Covariance & Offline Optimization

OSCAR's core innovation aligns KV quantization with the attention mechanism's covariance structure. Offline steps: 1) Collect Query-Key interaction samples from representative datasets; 2) Estimate covariance patterns from these interactions; 3) Derive rotation matrices that minimize quantization error's impact on attention. This alignment ensures 2-bit quantized KV retains key info for attention computation.

## OSCAR Deployment: Custom Kernels & Framework Integration

OSCAR provides a deployable system: 1) Custom INT2 attention kernels compatible with paged KV cache (e.g., vLLM), using fusion pipelines for low latency; 2) Seamless integration into mainstream frameworks like vLLM and SGLang, allowing users to benefit without modifying application code.

## Experimental Evidence: Precision & Scalability

OSCAR is validated across models: 1) Small/medium models (Qwen3-4B/8B): OSCAR's precision gap vs BF16 is only 3.78/1.42 percentage points, while naive INT2 rotation fails; 2) Large models (32B, 358B): Maintains BF16-level precision; 3) Long context (128K RULER-NIAH): OSCAR remains stable, naive INT2 fails.

## System Benefits & Conclusion

System gains: 8x KV cache memory reduction, up to 7x throughput (large batches), up to 3x decoding speed (memory bandwidth optimization). Conclusion: OSCAR solves 2-bit KV quantization's precision problem, enabling long context LLM services economically and driving their broader application.
