# DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling Format

> DuQuant++ achieves fine-grained rotational optimization for activation outliers by aligning the rotation block size with the MXFP4 micro-scaling group size, reducing online rotation computation cost by half while maintaining SOTA performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-20T04:27:28.000Z
- 最近活动: 2026-04-21T06:20:54.394Z
- 热度: 125.1
- 关键词: 量化, MXFP4, 大语言模型, 推理优化, NVIDIA Blackwell, LLaMA-3, 异常值处理, 旋转变换
- 页面链接: https://www.zingnex.cn/en/forum/thread/duquant-mxfp4
- Canonical: https://www.zingnex.cn/forum/thread/duquant-mxfp4
- Markdown 来源: floors_fallback

---

## Introduction: DuQuant++ — A New Fine-Grained Rotational Quantization Scheme for MXFP4 Format

DuQuant++ is a new fine-grained rotational quantization method for the MXFP4 micro-scaling format. By aligning the rotation block size with the MXFP4 group size, it achieves precise optimization of activation outliers. While maintaining SOTA performance, this method reduces online rotation computation cost by half, providing a new path for efficient deployment of large models at 4-bit precision.

## Background: Quantization Dilemmas in Large Model Inference and Opportunities with MXFP4

As LLM scales expand, memory bandwidth and computation cost for inference become bottlenecks. Traditional quantization techniques struggle to maintain model quality at ultra-low precision (e.g., 4-bit). The MXFP4 format introduced by NVIDIA's Blackwell architecture divides tensors into 32-element groups, each sharing a scaling factor and supporting Tensor Core acceleration. Theoretically, it enables extreme W4A4 compression without losing speed.

## Core Challenge of MXFP4: The Domino Effect of Outliers

Under MXFP4's group-shared scaling mechanism, a single activation outlier raises the scaling factor of the entire 32-element group, compressing the dynamic range of normal elements and amplifying quantization errors. However, LLM activation distributions have long-tailed characteristics with sparse outliers, which creates a structural conflict with MXFP4's fixed grouping strategy.

## Limitations of Existing Rotation Schemes: Data-Independent Blindness

Existing rotation schemes (random Hadamard transform, learnable rotation) have data-independent flaws: random Hadamard blindly disperses outliers, while learnable rotation focuses on global errors rather than outlier channels, leading to resource waste—complex transformations are required for the entire tensor to handle a few outlier channels.

## DuQuant++ Innovation: Fine-Grained Outlier-Aware Rotation

The core innovation of DuQuant++ lies in aligning the rotation block with the 32-element group size of MXFP4, simplifying the preprocessing flow (no need for double rotation or zigzag permutation). By identifying channels with concentrated outliers and constructing rotation matrices to disperse their energy, it achieves precise optimization, reducing online rotation cost by half. At the same time, it enhances the smoothing effect of weight distribution and suppresses quantization errors.

## Experimental Validation: SOTA Performance on LLaMA-3

Under the W4A4 quantization configuration of the LLaMA-3 model family, DuQuant++ achieves SOTA performance. Compared to the original DuQuant, rotation overhead is reduced by 50%, and perplexity and downstream task accuracy are further improved, verifying the effectiveness of the 'alignment equals simplification' technical route.

## Engineering Significance and Outlook: A Practical Path for LLM Quantization

DuQuant++ advances LLM quantization toward practicality, adapting to the MXFP4 format of NVIDIA Blackwell and subsequent architectures, making the deployment of high-quality large models at 4-bit precision an engineering reality. The code has been open-sourced, providing a ready-to-use optimization path for LLM deployment in resource-constrained environments without modifying the architecture or retraining.
