Section 01
TurboQuant: 80% Memory Reduction for Local LLM Inference via KV Cache Compression
Based on Google Research's ICLR 2026 paper, the TurboQuant algorithm (implemented by the tqai open-source project) uses polar quantization and random orthogonal rotation to compress KV cache to ~3 bits per channel. This achieves an 80% memory reduction while maintaining almost no loss in model quality, revolutionizing local LLM deployment. It supports PyTorch (CPU/CUDA) and MLX (Apple Silicon) backends.