Section 01
导读 / 主楼:Nexusquant: KV Cache Compression Technology to Extend Large Models' Run on Consumer GPUs
Introduction / Main Post: Nexusquant: KV Cache Compression Technology to Extend Large Models' Run on Consumer GPUs
Introducing the Nexusquant project, a KV cache compression scheme based on E8 lattice quantization and attention-aware token eviction, which can reduce memory usage by 10-33 times and enable local deployment of large language models with longer contexts without training.