Reading

DynaQuant: Dynamic Precision Quantization for Large Language Models via Bit-Level Water-Filling Algorithm

DynaQuant proposes an innovative dynamic precision quantization method that uses a water-filling algorithm to allocate optimal bit counts for each weight matrix. On the Qwen3.5-27B model, it achieves an average of 5.7 bits, 64% memory reduction, 2.8x inference speedup, and a quality loss of less than 1%.

量化大语言模型水填充算法动态精度推理优化内存压缩HAWQ帕累托最优

Published 2026-04-12 23:46Recent activity 2026-04-12 23:49Estimated read 7 min

DynaQuant: Dynamic Precision Quantization for Large Language Models via Bit-Level Water-Filling Algorithm

Section 01

DynaQuant: Dynamic Precision Quantization Empowers Efficient Deployment of Large Models

DynaQuant proposes an innovative dynamic precision quantization method that uses a bit-level water-filling algorithm to allocate optimal bit counts for each weight matrix. On the Qwen3.5-27B model, it achieves an average of 5.7 bits, 64% memory reduction, 2.8x inference speedup, and a quality loss of less than 1%, reaching a Pareto optimal balance between model quality and deployment efficiency.

Section 02

Background: Memory Bottlenecks in Large Model Inference and Limitations of Traditional Quantization

As large language models scale up, memory consumption during inference becomes a deployment barrier. For example, the Qwen3.5-27B model requires approximately 48.7GB of VRAM in BF16 format, which is unaffordable for consumer-grade hardware. Traditional uniform precision quantization strategies (e.g., full FP4/FP8) ignore the differences in precision sensitivity between layers, leading to excessive quality loss or insufficient memory savings.

Section 03

Core Method: Water-Filling Algorithm and Three-Step Technical Implementation

Core Insight

Each weight matrix's bits contribute differently to model quality marginally; thus, bits should be allocated on demand to achieve Pareto optimality, drawing inspiration from the water-filling algorithm in communication theory.

Three-Step Technical Implementation

Sensitivity Measurement: Using HAWQ-V3-style Fisher diagonal approximation, with the indicator sensitivity = h_trace × mean(w²), which has a 0.93 correlation with KL divergence. It only requires one forward + backward pass, resulting in low overhead.
Bit Allocation: The water-filling algorithm marginally upgrades bits on a max-heap, prioritizing marginal quality improvement per byte cost. It supports hardware-native modes (4/8/16 bits) and full modes (4-16 bits).
Application Recipe: Apply specific bit quantization to each weight matrix according to the allocation results. Currently, it uses software simulation; production requires custom dequantization kernels.

Section 04

Experimental Evidence: Pareto Front of Quality and Efficiency

Results on Qwen3.5-27B

Scheme	Average Bits	Memory Usage	Decoding Speedup	Quality Loss (PPL)
BF16 Baseline	16.0	48.7GB	1.0×	Baseline
DynaQuant Inflection Point	5.7	17.4GB	2.8×	+0.59%
Uniform FP4	4.0	12.2GB	3.7×	+6.8%

Bit Value Spectrum

5-7 bits are the quantization sweet spot: close to FP8 quality but only 62% of FP8's cost; performance degrades below 4 bits, and precision is wasted above 12 bits.

Cross-Scale Validation

Downstream tasks (arc_easy + piqa) show that quantized models of 4B, 27B, and 35B MoE sizes have no significant difference from the BF16 baseline, maintaining equivalent quality.

Section 05

Key Research Findings and Conclusions

h_trace × mean(w²) is the optimal sensitivity indicator, outperforming HAWQ-V3's Σ(H_i · w_i²).
Rotation is not beneficial for NVFP4 per-group quantization, as its non-uniform binning already adapts to Gaussian distributions.
Refinement iterations are unnecessary; the correlation coefficient between initial and refined HAWQ rankings reaches 0.998.
The Pareto inflection point for Qwen3.5-27B is at 5.7 bits, mainly allocated 5-6 bits.
MoE model quantization performance is comparable or better than dense models; the 35B-A3B MoE cost is only 37% of BF16.

Section 06

Application Prospects and Project Roadmap

Practical Significance

Consumer-grade hardware: High-end consumer GPUs can run models previously requiring professional accelerators.
Edge devices: Smaller memory footprint promotes the migration of large models to edge devices.
Cost optimization: Cloud service providers reduce hardware costs.
Energy efficiency: Reduced memory bandwidth requirements lower power consumption.

Project Roadmap

Completed: HAWQ measurement pipeline, Pareto allocator, recipe materialization, GPU dequantization prototype, bit packing tool.
In Progress: Fusing dequantization + matrix multiplication kernels, disk-packaged weight format.
Planned: vLLM QuantizationMethod plugin.