Section 01
[Introduction] ITQ3_S: A High-Precision Breakthrough Solution for 3-Bit Large Model Inference
ITQ3_S is a ternary interleaved quantization technique based on rotational domain smoothing. Its core uses the Fast Walsh-Hadamard Transform (FWHT) to pre-rotate the weight space, dispersing the energy of outliers across the entire vector to achieve perplexity performance close to FP16. Meanwhile, on the NVIDIA RTX 5090, its throughput reaches over 1.5x that of 4-bit alternatives, providing a balanced solution of high precision and high performance for low-bit inference of large models.