Section 01
OSCAR: 2-bit KV Cache Quantization with Spectral Covariance-Aware Rotation (Introduction)
OSCAR (Offline Spectral Covariance-Aware Rotation) addresses long context LLM services' KV cache memory bottleneck via 2-bit quantization. It offline estimates attention-aware covariance structures to derive rotation and cropping thresholds, achieving 8x memory compression, up to 7x throughput improvement, and maintaining BF16-level precision. This work is critical for making long context LLM services economically feasible.