Reading

OSCAR: A Spectral Covariance-Aware Rotation Method for 2-bit KV Cache Quantization

OSCAR derives rotation and cropping thresholds by offline estimating attention-aware covariance structures, achieving high-precision 2-bit KV cache quantization. It maintains BF16-level precision while enabling 8x memory compression and 7x throughput improvement.

KV缓存量化2-bit量化注意力机制协方差感知长上下文LLM推理优化内存压缩

Published 2026-05-18 10:24Recent activity 2026-05-19 10:57Estimated read 4 min

OSCAR: A Spectral Covariance-Aware Rotation Method for 2-bit KV Cache Quantization

Section 01

OSCAR: 2-bit KV Cache Quantization with Spectral Covariance-Aware Rotation (Introduction)

OSCAR (Offline Spectral Covariance-Aware Rotation) addresses long context LLM services' KV cache memory bottleneck via 2-bit quantization. It offline estimates attention-aware covariance structures to derive rotation and cropping thresholds, achieving 8x memory compression, up to 7x throughput improvement, and maintaining BF16-level precision. This work is critical for making long context LLM services economically feasible.

Section 02

Background: Long Context LLM's KV Cache Bottleneck & 2-bit Quantization Challenges

As LLM context windows expand to 128K+ tokens, KV cache memory usage becomes a key deployment bottleneck, limiting batch size and throughput. Quantization reduces memory but 2-bit (INT2) faces two core issues: 1) Simple methods cause sharp precision drops; 2) High-precision methods often require complex custom kernels, hard to integrate into existing frameworks.

Section 03

OSCAR's Core Idea: Attention-Aware Covariance & Offline Optimization

OSCAR's core innovation aligns KV quantization with the attention mechanism's covariance structure. Offline steps: 1) Collect Query-Key interaction samples from representative datasets; 2) Estimate covariance patterns from these interactions; 3) Derive rotation matrices that minimize quantization error's impact on attention. This alignment ensures 2-bit quantized KV retains key info for attention computation.

Section 04

OSCAR Deployment: Custom Kernels & Framework Integration

OSCAR provides a deployable system: 1) Custom INT2 attention kernels compatible with paged KV cache (e.g., vLLM), using fusion pipelines for low latency; 2) Seamless integration into mainstream frameworks like vLLM and SGLang, allowing users to benefit without modifying application code.

Section 05

Experimental Evidence: Precision & Scalability

OSCAR is validated across models: 1) Small/medium models (Qwen3-4B/8B): OSCAR's precision gap vs BF16 is only 3.78/1.42 percentage points, while naive INT2 rotation fails; 2) Large models (32B, 358B): Maintains BF16-level precision; 3) Long context (128K RULER-NIAH): OSCAR remains stable, naive INT2 fails.

Section 06

System Benefits & Conclusion

System gains: 8x KV cache memory reduction, up to 7x throughput (large batches), up to 3x decoding speed (memory bandwidth optimization). Conclusion: OSCAR solves 2-bit KV quantization's precision problem, enabling long context LLM services economically and driving their broader application.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15