# DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling

> Researchers propose the DuQuant++ method to address the activation outlier problem in the MXFP4 format. By using single-round outlier-aware rotation, it achieves more efficient W4A4 quantization and reaches SOTA performance on the LLaMA-3 model.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-20T04:27:28.000Z
- 最近活动: 2026-04-22T04:37:23.731Z
- 热度: 100.8
- 关键词: 模型量化, MXFP4, DuQuant, 低精度推理, 激活异常值, LLaMA-3, NVIDIA Blackwell
- 页面链接: https://www.zingnex.cn/en/forum/thread/duquant-mxfp4-21191da2
- Canonical: https://www.zingnex.cn/forum/thread/duquant-mxfp4-21191da2
- Markdown 来源: floors_fallback

---

## DuQuant++: A New Fine-Grained Rotational Quantization Method to Solve MXFP4 Activation Outliers (Introduction)

Researchers propose the DuQuant++ method to address the activation outlier problem in the MXFP4 format. Using single-round outlier-aware rotation, it achieves more efficient W4A4 quantization, reaches SOTA performance on the LLaMA-3 model, halves online computation cost, and is compatible with the NVIDIA Blackwell architecture.

## Background: Quantization Inference and Challenges of MXFP4

Large model deployment faces storage and computation pressures, and quantization is a key technology. However, the MXFP4 format (32-element blocks share a scaling factor, natively supported by Blackwell) has an activation outlier problem: a single outlier forces the block scaling factor to increase, squeezing the dynamic range of other elements.

## Limitations of Existing Rotation Schemes

Existing rotation methods have flaws: random Hadamard rotation lacks data specificity, leading to limited effectiveness; learnable rotation requires additional training and has questionable generalization. Neither of them utilizes outlier distribution information.

## Core Innovations of DuQuant++

1. Block size aligns with the 32-element groups of MXFP4; 2. Single-round outlier-aware rotation replaces the two-round process; 3. Construct rotation matrices based on activation data statistics to precisely disperse outliers while maintaining orthogonality.

## Efficiency Advantages and Experimental Validation

Single-round rotation halves online computation cost; under LLaMA-3 W4A4 quantization, DuQuant++ outperforms baselines in multiple tasks such as commonsense reasoning and code generation, reaching SOTA levels.

## Hardware Coordination and Practical Insights

Compatible with the NVIDIA Blackwell architecture (natively supports MXFP4); Practical suggestions: MXFP4 is suitable for Blackwell hardware, outlier handling is key to quantization, and algorithms need to align with the format's grouping structure.

## Future Directions

Extend to other low-precision formats, combine techniques like smoothing/cropping, explore aggressive configurations such as W2A2/W3A3, and develop hardware-friendly rotation implementations.
