Zing Forum

Reading

DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling

Researchers propose the DuQuant++ method to address the activation outlier problem in the MXFP4 format. By using single-round outlier-aware rotation, it achieves more efficient W4A4 quantization and reaches SOTA performance on the LLaMA-3 model.

模型量化MXFP4DuQuant低精度推理激活异常值LLaMA-3NVIDIA Blackwell
Published 2026-04-20 12:27Recent activity 2026-04-22 12:37Estimated read 3 min
DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling
1

Section 01

DuQuant++: A New Fine-Grained Rotational Quantization Method to Solve MXFP4 Activation Outliers (Introduction)

Researchers propose the DuQuant++ method to address the activation outlier problem in the MXFP4 format. Using single-round outlier-aware rotation, it achieves more efficient W4A4 quantization, reaches SOTA performance on the LLaMA-3 model, halves online computation cost, and is compatible with the NVIDIA Blackwell architecture.

2

Section 02

Background: Quantization Inference and Challenges of MXFP4

Large model deployment faces storage and computation pressures, and quantization is a key technology. However, the MXFP4 format (32-element blocks share a scaling factor, natively supported by Blackwell) has an activation outlier problem: a single outlier forces the block scaling factor to increase, squeezing the dynamic range of other elements.

3

Section 03

Limitations of Existing Rotation Schemes

Existing rotation methods have flaws: random Hadamard rotation lacks data specificity, leading to limited effectiveness; learnable rotation requires additional training and has questionable generalization. Neither of them utilizes outlier distribution information.

4

Section 04

Core Innovations of DuQuant++

  1. Block size aligns with the 32-element groups of MXFP4; 2. Single-round outlier-aware rotation replaces the two-round process; 3. Construct rotation matrices based on activation data statistics to precisely disperse outliers while maintaining orthogonality.
5

Section 05

Efficiency Advantages and Experimental Validation

Single-round rotation halves online computation cost; under LLaMA-3 W4A4 quantization, DuQuant++ outperforms baselines in multiple tasks such as commonsense reasoning and code generation, reaching SOTA levels.

6

Section 06

Hardware Coordination and Practical Insights

Compatible with the NVIDIA Blackwell architecture (natively supports MXFP4); Practical suggestions: MXFP4 is suitable for Blackwell hardware, outlier handling is key to quantization, and algorithms need to align with the format's grouping structure.

7

Section 07

Future Directions

Extend to other low-precision formats, combine techniques like smoothing/cropping, explore aggressive configurations such as W2A2/W3A3, and develop hardware-friendly rotation implementations.