Reading

DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling

Researchers propose the DuQuant++ method to address the activation outlier problem in the MXFP4 format. By using single-round outlier-aware rotation, it achieves more efficient W4A4 quantization and reaches SOTA performance on the LLaMA-3 model.

模型量化MXFP4DuQuant低精度推理激活异常值LLaMA-3NVIDIA Blackwell

Published 2026-04-20 12:27Recent activity 2026-04-22 12:37Estimated read 3 min

DuQuant++: A New Fine-Grained Rotational Quantization Method for MXFP4 Micro-Scaling

Section 01

DuQuant++: A New Fine-Grained Rotational Quantization Method to Solve MXFP4 Activation Outliers (Introduction)

Researchers propose the DuQuant++ method to address the activation outlier problem in the MXFP4 format. Using single-round outlier-aware rotation, it achieves more efficient W4A4 quantization, reaches SOTA performance on the LLaMA-3 model, halves online computation cost, and is compatible with the NVIDIA Blackwell architecture.

Section 02

Background: Quantization Inference and Challenges of MXFP4

Large model deployment faces storage and computation pressures, and quantization is a key technology. However, the MXFP4 format (32-element blocks share a scaling factor, natively supported by Blackwell) has an activation outlier problem: a single outlier forces the block scaling factor to increase, squeezing the dynamic range of other elements.

Section 03

Limitations of Existing Rotation Schemes

Existing rotation methods have flaws: random Hadamard rotation lacks data specificity, leading to limited effectiveness; learnable rotation requires additional training and has questionable generalization. Neither of them utilizes outlier distribution information.

Section 04

Core Innovations of DuQuant++

Block size aligns with the 32-element groups of MXFP4; 2. Single-round outlier-aware rotation replaces the two-round process; 3. Construct rotation matrices based on activation data statistics to precisely disperse outliers while maintaining orthogonality.

Section 05

Efficiency Advantages and Experimental Validation

Single-round rotation halves online computation cost; under LLaMA-3 W4A4 quantization, DuQuant++ outperforms baselines in multiple tasks such as commonsense reasoning and code generation, reaching SOTA levels.

Section 06

Hardware Coordination and Practical Insights

Compatible with the NVIDIA Blackwell architecture (natively supports MXFP4); Practical suggestions: MXFP4 is suitable for Blackwell hardware, outlier handling is key to quantization, and algorithms need to align with the format's grouping structure.

Section 07

Future Directions

Extend to other low-precision formats, combine techniques like smoothing/cropping, explore aggressive configurations such as W2A2/W3A3, and develop hardware-friendly rotation implementations.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49