Section 01
Nexusquant: KV Cache Compression Technology for Longer Context Large Models on Consumer GPUs
Nexusquant is a large model inference optimization project focused on KV cache compression. Using two key technologies—E8 lattice quantization and attention-aware token elimination—it can reduce KV cache memory usage by 10-33 times. This allows consumer GPUs (with 8-16GB memory) to locally deploy large language models supporting longer contexts without additional training.