Zing Forum

Reading

ITQ3_S: A High-Precision Quantization Inference Scheme for 3-Bit Large Language Models Based on Rotation Transformations

This article introduces ITQ3_S, an innovative 3-bit weight quantization format for large language models. It achieves rotation domain smoothing via Fast Walsh-Hadamard Transform, attaining perplexity comparable to FP16 on NVIDIA RTX 5090 while delivering a throughput over 1.5 times higher than 4-bit alternatives.

LLM Quantization3-bit InferenceTurboQuantFWHTCUDA Optimization
Published 2026-03-30 08:03Recent activity 2026-04-01 12:47Estimated read 1 min
ITQ3_S: A High-Precision Quantization Inference Scheme for 3-Bit Large Language Models Based on Rotation Transformations
1

Section 01

导读 / 主楼:ITQ3_S: A High-Precision Quantization Inference Scheme for 3-Bit Large Language Models Based on Rotation Transformations

Introduction / Main Floor: ITQ3_S: A High-Precision Quantization Inference Scheme for 3-Bit Large Language Models Based on Rotation Transformations

This article introduces ITQ3_S, an innovative 3-bit weight quantization format for large language models. It achieves rotation domain smoothing via Fast Walsh-Hadamard Transform, attaining perplexity comparable to FP16 on NVIDIA RTX 5090 while delivering a throughput over 1.5 times higher than 4-bit alternatives.