Zing Forum

Reading

FRM-PTQ: A New Low-Bit Large Model Quantization Method Enhanced by Feature Relationship Matching

The FRM-PTQ framework proposed by the research team from Harbin Institute of Technology (Shenzhen) achieves near-full-precision inference performance in W4A4 low-bit scenarios through feature relationship matching and multi-granularity group quantization techniques. It also brings a 2x throughput improvement and 3.17x memory compression, and is particularly suitable for new-generation models such as LLaMA-3 and Qwen2.5.

大模型量化后训练量化PTQ特征关系匹配低比特推理LLaMAQwen模型压缩
Published 2026-04-04 01:13Recent activity 2026-04-04 01:20Estimated read 6 min
FRM-PTQ: A New Low-Bit Large Model Quantization Method Enhanced by Feature Relationship Matching
1

Section 01

FRM-PTQ: A New Low-Bit Large Model Quantization Method Enhanced by Feature Relationship Matching (Introduction)

The research team from Harbin Institute of Technology (Shenzhen) proposed the FRM-PTQ framework. Through feature relationship matching and multi-granularity group quantization techniques, it achieves near-full-precision inference performance in W4A4 low-bit scenarios, while bringing a 2x throughput improvement and 3.17x memory compression. It is particularly suitable for new-generation models such as LLaMA-3 and Qwen2.5.

2

Section 02

Research Background and Challenges

The inference cost of large language models is a key bottleneck for large-scale applications. Post-training quantization (PTQ) is an effective method to reduce memory usage and computational requirements, but existing PTQ methods suffer from severe performance degradation in ultra-low-bit scenarios (4 bits and below), especially on new-generation models like LLaMA-3. Traditional PTQ relies on mean squared error (MSE) loss, which only focuses on point-to-point numerical differences and ignores the structural relationships in the high-dimensional feature space, leading to a decline in the representation ability of quantized models.

3

Section 03

Core Innovations of FRM-PTQ

FRM-PTQ has two core innovations: 1. Feature Relationship Matching Mechanism: Includes token-level relationship modeling (capturing mutual relationships between sequence tokens) and structure-level distribution alignment (intra-block self-distillation to align feature distributions between quantized blocks and full-precision blocks); 2. Multi-granularity Group Quantization Technique: Identifies sensitive groups and robust groups through kurtosis analysis, configures differentiated quantization strategies, and improves efficiency with custom CUDA kernels.

4

Section 04

Experimental Results and Performance Analysis

In W4A4 scenarios: Precision (PPL) is close to full precision, throughput is improved by 2x, and memory is compressed by 3.17x; it performs well on new models like LLaMA-3 and Qwen2.5, and maintains usable performance even in the extreme W3A3 scenario. The team provides pre-quantized models for LLaMA-2-13B (W2A16) and LLaMA-3-8B (W3A3) to ensure research reproducibility.

5

Section 05

Technical Implementation Details

The usage process is divided into three steps: Environment Preparation (install dependencies in a Python3.11 conda environment), Sensitivity Analysis (kurtosis calculation script to identify sensitive/robust groups), and Model Quantization (execute the main script, supporting configurations like W4A16/W4A4, and allowing specification of grouping strategies, calibration datasets, and training parameters). Configuration examples include weight/activation bit counts, group size, sensitive group designation, calibration dataset selection, hyperparameter configuration, etc.

6

Section 06

Academic Contributions and Impact

The achievement was published in the 2026 issue of the journal Neural Networks, built based on open-source projects like EfficientQAT and GPTQ. Theoretical contribution: Proposed the idea of feature relationship matching, providing a new perspective for PTQ design—quantization is a re-encoding of feature representations, which needs to maintain feature relationships in high-dimensional space rather than just minimizing point-wise errors.

7

Section 07

Practical Application Value and Open Source Community

Application Value: Edge device deployment (an original 24GB model can run on 8GB VRAM), inference cost optimization (2x throughput reduces service costs), adaptation to new models (LLaMA-3, Qwen2.5). Open Source: Released under the Apache License 2.0, with code and pre-trained models publicly available to promote technology dissemination and follow-up research.