Zing Forum

Reading

DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling Format

DuQuant++ achieves fine-grained rotational optimization for activation outliers by aligning the rotation block size with the MXFP4 micro-scaling group size, reducing online rotation computation cost by half while maintaining SOTA performance.

量化MXFP4大语言模型推理优化NVIDIA BlackwellLLaMA-3异常值处理旋转变换
Published 2026-04-20 12:27Recent activity 2026-04-21 14:20Estimated read 5 min
DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling Format
1

Section 01

Introduction: DuQuant++ — A New Fine-Grained Rotational Quantization Scheme for MXFP4 Format

DuQuant++ is a new fine-grained rotational quantization method for the MXFP4 micro-scaling format. By aligning the rotation block size with the MXFP4 group size, it achieves precise optimization of activation outliers. While maintaining SOTA performance, this method reduces online rotation computation cost by half, providing a new path for efficient deployment of large models at 4-bit precision.

2

Section 02

Background: Quantization Dilemmas in Large Model Inference and Opportunities with MXFP4

As LLM scales expand, memory bandwidth and computation cost for inference become bottlenecks. Traditional quantization techniques struggle to maintain model quality at ultra-low precision (e.g., 4-bit). The MXFP4 format introduced by NVIDIA's Blackwell architecture divides tensors into 32-element groups, each sharing a scaling factor and supporting Tensor Core acceleration. Theoretically, it enables extreme W4A4 compression without losing speed.

3

Section 03

Core Challenge of MXFP4: The Domino Effect of Outliers

Under MXFP4's group-shared scaling mechanism, a single activation outlier raises the scaling factor of the entire 32-element group, compressing the dynamic range of normal elements and amplifying quantization errors. However, LLM activation distributions have long-tailed characteristics with sparse outliers, which creates a structural conflict with MXFP4's fixed grouping strategy.

4

Section 04

Limitations of Existing Rotation Schemes: Data-Independent Blindness

Existing rotation schemes (random Hadamard transform, learnable rotation) have data-independent flaws: random Hadamard blindly disperses outliers, while learnable rotation focuses on global errors rather than outlier channels, leading to resource waste—complex transformations are required for the entire tensor to handle a few outlier channels.

5

Section 05

DuQuant++ Innovation: Fine-Grained Outlier-Aware Rotation

The core innovation of DuQuant++ lies in aligning the rotation block with the 32-element group size of MXFP4, simplifying the preprocessing flow (no need for double rotation or zigzag permutation). By identifying channels with concentrated outliers and constructing rotation matrices to disperse their energy, it achieves precise optimization, reducing online rotation cost by half. At the same time, it enhances the smoothing effect of weight distribution and suppresses quantization errors.

6

Section 06

Experimental Validation: SOTA Performance on LLaMA-3

Under the W4A4 quantization configuration of the LLaMA-3 model family, DuQuant++ achieves SOTA performance. Compared to the original DuQuant, rotation overhead is reduced by 50%, and perplexity and downstream task accuracy are further improved, verifying the effectiveness of the 'alignment equals simplification' technical route.

7

Section 07

Engineering Significance and Outlook: A Practical Path for LLM Quantization

DuQuant++ advances LLM quantization toward practicality, adapting to the MXFP4 format of NVIDIA Blackwell and subsequent architectures, making the deployment of high-quality large models at 4-bit precision an engineering reality. The code has been open-sourced, providing a ready-to-use optimization path for LLM deployment in resource-constrained environments without modifying the architecture or retraining.