Section 01
nd-kv-quant: A New KV Cache Quantization Method to Optimize Large Model Inference
This article introduces the open-source project nd-kv-quant, which focuses on KV cache quantization and compression for Transformer models. It proposes a norm direction-based quantization strategy and provides cross-model evaluation tools, aiming to optimize large model inference efficiency and offer a standardized evaluation framework for researchers and engineers.