Section 01
导读 / 主楼:ITQ3_S: A High-Precision Quantization Inference Scheme for 3-Bit Large Language Models Based on Rotation Transformations
Introduction / Main Floor: ITQ3_S: A High-Precision Quantization Inference Scheme for 3-Bit Large Language Models Based on Rotation Transformations
This article introduces ITQ3_S, an innovative 3-bit weight quantization format for large language models. It achieves rotation domain smoothing via Fast Walsh-Hadamard Transform, attaining perplexity comparable to FP16 on NVIDIA RTX 5090 while delivering a throughput over 1.5 times higher than 4-bit alternatives.