Zing Forum

Reading

DynaQuant: Dynamic Precision Quantization for Large Language Models via Bit-Level Water-Filling Algorithm

DynaQuant proposes an innovative dynamic precision quantization method that uses a water-filling algorithm to allocate optimal bit counts for each weight matrix. On the Qwen3.5-27B model, it achieves an average of 5.7 bits, 64% memory reduction, 2.8x inference speedup, and a quality loss of less than 1%.

量化大语言模型水填充算法动态精度推理优化内存压缩HAWQ帕累托最优
Published 2026-04-12 23:46Recent activity 2026-04-12 23:49Estimated read 7 min
DynaQuant: Dynamic Precision Quantization for Large Language Models via Bit-Level Water-Filling Algorithm
1

Section 01

DynaQuant: Dynamic Precision Quantization Empowers Efficient Deployment of Large Models

DynaQuant proposes an innovative dynamic precision quantization method that uses a bit-level water-filling algorithm to allocate optimal bit counts for each weight matrix. On the Qwen3.5-27B model, it achieves an average of 5.7 bits, 64% memory reduction, 2.8x inference speedup, and a quality loss of less than 1%, reaching a Pareto optimal balance between model quality and deployment efficiency.

2

Section 02

Background: Memory Bottlenecks in Large Model Inference and Limitations of Traditional Quantization

As large language models scale up, memory consumption during inference becomes a deployment barrier. For example, the Qwen3.5-27B model requires approximately 48.7GB of VRAM in BF16 format, which is unaffordable for consumer-grade hardware. Traditional uniform precision quantization strategies (e.g., full FP4/FP8) ignore the differences in precision sensitivity between layers, leading to excessive quality loss or insufficient memory savings.

3

Section 03

Core Method: Water-Filling Algorithm and Three-Step Technical Implementation

Core Insight

Each weight matrix's bits contribute differently to model quality marginally; thus, bits should be allocated on demand to achieve Pareto optimality, drawing inspiration from the water-filling algorithm in communication theory.

Three-Step Technical Implementation

  1. Sensitivity Measurement: Using HAWQ-V3-style Fisher diagonal approximation, with the indicator sensitivity = h_trace × mean(w²), which has a 0.93 correlation with KL divergence. It only requires one forward + backward pass, resulting in low overhead.
  2. Bit Allocation: The water-filling algorithm marginally upgrades bits on a max-heap, prioritizing marginal quality improvement per byte cost. It supports hardware-native modes (4/8/16 bits) and full modes (4-16 bits).
  3. Application Recipe: Apply specific bit quantization to each weight matrix according to the allocation results. Currently, it uses software simulation; production requires custom dequantization kernels.
4

Section 04

Experimental Evidence: Pareto Front of Quality and Efficiency

Results on Qwen3.5-27B

Scheme Average Bits Memory Usage Decoding Speedup Quality Loss (PPL)
BF16 Baseline 16.0 48.7GB 1.0× Baseline
DynaQuant Inflection Point 5.7 17.4GB 2.8× +0.59%
Uniform FP4 4.0 12.2GB 3.7× +6.8%

Bit Value Spectrum

5-7 bits are the quantization sweet spot: close to FP8 quality but only 62% of FP8's cost; performance degrades below 4 bits, and precision is wasted above 12 bits.

Cross-Scale Validation

Downstream tasks (arc_easy + piqa) show that quantized models of 4B, 27B, and 35B MoE sizes have no significant difference from the BF16 baseline, maintaining equivalent quality.

5

Section 05

Key Research Findings and Conclusions

  1. h_trace × mean(w²) is the optimal sensitivity indicator, outperforming HAWQ-V3's Σ(H_i · w_i²).
  2. Rotation is not beneficial for NVFP4 per-group quantization, as its non-uniform binning already adapts to Gaussian distributions.
  3. Refinement iterations are unnecessary; the correlation coefficient between initial and refined HAWQ rankings reaches 0.998.
  4. The Pareto inflection point for Qwen3.5-27B is at 5.7 bits, mainly allocated 5-6 bits.
  5. MoE model quantization performance is comparable or better than dense models; the 35B-A3B MoE cost is only 37% of BF16.
6

Section 06

Application Prospects and Project Roadmap

Practical Significance

  • Consumer-grade hardware: High-end consumer GPUs can run models previously requiring professional accelerators.
  • Edge devices: Smaller memory footprint promotes the migration of large models to edge devices.
  • Cost optimization: Cloud service providers reduce hardware costs.
  • Energy efficiency: Reduced memory bandwidth requirements lower power consumption.

Project Roadmap

  • Completed: HAWQ measurement pipeline, Pareto allocator, recipe materialization, GPU dequantization prototype, bit packing tool.
  • In Progress: Fusing dequantization + matrix multiplication kernels, disk-packaged weight format.
  • Planned: vLLM QuantizationMethod plugin.