Zing Forum

Reading

CSAQ Quantization Framework: Protecting Large Model Reasoning Ability with Causal Salience Scoring

CSAQ is a post-training quantization method that identifies critical weights using causal importance scores (gradient × activation). It preserves model reasoning ability under 4-bit quantization and addresses the issue where 80% of critical weights are incorrectly quantized by methods like AWQ.

量化LLM模型压缩因果显著性AWQ4-bit量化推理优化边缘部署
Published 2026-04-05 21:44Recent activity 2026-04-05 21:47Estimated read 6 min
CSAQ Quantization Framework: Protecting Large Model Reasoning Ability with Causal Salience Scoring
1

Section 01

Introduction / Main Post: CSAQ Quantization Framework: Protecting Large Model Reasoning Ability with Causal Salience Scoring

CSAQ is a post-training quantization method that identifies critical weights using causal importance scores (gradient × activation). It preserves model reasoning ability under 4-bit quantization and addresses the issue where 80% of critical weights are incorrectly quantized by methods like AWQ.

2

Section 02

Background: The Dilemma of Quantization Technology

The deployment cost of large language models (LLMs) has always been a core challenge in the AI engineering field. As model parameter sizes grow from billions to trillions, the memory and computing resources required for inference increase exponentially. Quantization technology—compressing model weights from high-precision floating-point numbers (FP32/FP16) to low-precision integers (INT8/INT4)—has become an essential path to reduce deployment costs.

However, traditional quantization methods face a fundamental contradiction: the higher the compression rate, the greater the model performance loss. Existing methods like AWQ use activation magnitude as a proxy for weight importance, but studies show that this proxy has only about 20% consistency with true causal salience. This means that when we perform 4-bit quantization, 80% of the truly critical weights are incorrectly subjected to aggressive quantization strategies.

3

Section 03

Core Innovations of CSAQ

CSAQ (Causal Salience Quantization) proposes a brand-new quantization paradigm. Instead of relying on the rough proxy of activation magnitude, it uses causal salience scores (gradient × activation) to accurately identify which weights are truly important for model reasoning.

4

Section 04

Mathematical Foundation of Causal Salience Scores

CSAQ's core insight comes from first-order Taylor approximation. For each weight, it calculates |grad × weight|—the change in the loss function when the weight is set to zero. This is a true causal measure, not an indirect proxy. Specifically, during N forward + backward propagation steps, CSAQ accumulates the product of each weight's gradient and the weight itself to obtain the true impact of the weight on the model output.

The theoretical advantage of this method is that it directly measures the weight's contribution to the loss function, rather than assuming that larger-magnitude weights are necessarily more important. In practice, many small-magnitude weights that are critical to specific reasoning paths can be identified and protected.

5

Section 05

Three-Stage Quantization Process

CSAQ's quantization process is divided into three distinct stages, all completed offline (only need to be executed once before deployment):

6

Section 06

Stage 1: Causal Salience Analysis

Run N forward + backward propagation steps on the calibration dataset to calculate the |grad × weight| value for each weight. Although this process is computationally intensive, it only needs to be executed once, and a small calibration set (64 samples recommended) can be used to obtain stable salience estimates.

7

Section 07

Stage 2: Bit Budget Solver

CSAQ uses binary search to iterate over salience thresholds to find an FP16/INT8/INT4 allocation scheme that achieves the target bit width (e.g., exactly 4.000 bits). This step ensures that CSAQ's results can be fairly compared with methods like AWQ and GPTQ under the same memory footprint.

8

Section 08

Stage 3: Hierarchical Quantization Application

Based on the solver's results, CSAQ applies a differentiated quantization strategy to each weight element:

  • Top ~5% (sorted by causal salience) → Keep FP16 precision, zero quantization loss
  • Next ~20% → Use INT8, minimal loss
  • Bottom ~75% → Use INT4 for aggressive compression, but these weights have little impact on model performance

The ingenuity of this hierarchical strategy lies in: it concentrates the limited precision budget on truly important weights, while applying aggressive compression to a large number of unimportant weights.