Reading

DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling Format

DuQuant++ achieves fine-grained rotational optimization for activation outliers by aligning the rotation block size with the MXFP4 micro-scaling group size, reducing online rotation computation cost by half while maintaining SOTA performance.

量化MXFP4大语言模型推理优化NVIDIA BlackwellLLaMA-3异常值处理旋转变换

Published 2026-04-20 12:27Recent activity 2026-04-21 14:20Estimated read 5 min

DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling Format

Section 01

Introduction: DuQuant++ — A New Fine-Grained Rotational Quantization Scheme for MXFP4 Format

DuQuant++ is a new fine-grained rotational quantization method for the MXFP4 micro-scaling format. By aligning the rotation block size with the MXFP4 group size, it achieves precise optimization of activation outliers. While maintaining SOTA performance, this method reduces online rotation computation cost by half, providing a new path for efficient deployment of large models at 4-bit precision.

Section 02

Background: Quantization Dilemmas in Large Model Inference and Opportunities with MXFP4

As LLM scales expand, memory bandwidth and computation cost for inference become bottlenecks. Traditional quantization techniques struggle to maintain model quality at ultra-low precision (e.g., 4-bit). The MXFP4 format introduced by NVIDIA's Blackwell architecture divides tensors into 32-element groups, each sharing a scaling factor and supporting Tensor Core acceleration. Theoretically, it enables extreme W4A4 compression without losing speed.

Section 03

Core Challenge of MXFP4: The Domino Effect of Outliers

Under MXFP4's group-shared scaling mechanism, a single activation outlier raises the scaling factor of the entire 32-element group, compressing the dynamic range of normal elements and amplifying quantization errors. However, LLM activation distributions have long-tailed characteristics with sparse outliers, which creates a structural conflict with MXFP4's fixed grouping strategy.

Section 04

Limitations of Existing Rotation Schemes: Data-Independent Blindness

Existing rotation schemes (random Hadamard transform, learnable rotation) have data-independent flaws: random Hadamard blindly disperses outliers, while learnable rotation focuses on global errors rather than outlier channels, leading to resource waste—complex transformations are required for the entire tensor to handle a few outlier channels.

Section 05

DuQuant++ Innovation: Fine-Grained Outlier-Aware Rotation

The core innovation of DuQuant++ lies in aligning the rotation block with the 32-element group size of MXFP4, simplifying the preprocessing flow (no need for double rotation or zigzag permutation). By identifying channels with concentrated outliers and constructing rotation matrices to disperse their energy, it achieves precise optimization, reducing online rotation cost by half. At the same time, it enhances the smoothing effect of weight distribution and suppresses quantization errors.

Section 06

Experimental Validation: SOTA Performance on LLaMA-3

Under the W4A4 quantization configuration of the LLaMA-3 model family, DuQuant++ achieves SOTA performance. Compared to the original DuQuant, rotation overhead is reduced by 50%, and perplexity and downstream task accuracy are further improved, verifying the effectiveness of the 'alignment equals simplification' technical route.

Section 07

Engineering Significance and Outlook: A Practical Path for LLM Quantization

DuQuant++ advances LLM quantization toward practicality, adapting to the MXFP4 format of NVIDIA Blackwell and subsequent architectures, making the deployment of high-quality large models at 4-bit precision an engineering reality. The code has been open-sourced, providing a ready-to-use optimization path for LLM deployment in resource-constrained environments without modifying the architecture or retraining.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49