Zing Forum

Reading

XFP: Quality Target-Oriented Adaptive Codebook Quantization and Sparse Outlier Separation Technology

XFP is a dynamic weight quantizer that reverses the traditional workflow—allowing operators to specify a lower bound for reconstruction quality, while the system automatically determines codebook size, outlier budget, and layer packaging strategy, without the need for Hessian matrices, calibration data, or manual bit-width selection.

LLM量化权重量化码本量化稀疏异常值自适应量化推理加速MoE模型质量目标
Published 2026-05-14 21:52Recent activity 2026-05-15 10:52Estimated read 7 min
XFP: Quality Target-Oriented Adaptive Codebook Quantization and Sparse Outlier Separation Technology
1

Section 01

[Introduction] XFP: Quality-Driven Adaptive LLM Quantization Technology

XFP is a quality target-oriented adaptive codebook quantization and sparse outlier separation technology. By reversing the traditional quantization workflow, operators specify a lower bound for reconstruction quality, and the system automatically determines codebook size, outlier budget, and layer packaging strategy—without Hessian matrices, calibration data, or manual bit-width selection—providing a more intuitive and reliable quantization solution for LLM deployment.

2

Section 02

Background: Traditional Dilemmas in LLM Quantization

Large Language Models (LLMs) face memory and computational challenges in inference deployment. Quantization is a key optimization technique, but traditional methods have the following limitations:

  • Require Hessian matrices: Dependent on second-order information, leading to high computational costs
  • Dependent on calibration data: Need representative datasets to search for quantization parameters
  • Manual bit-width selection: Operators must manually select bit-widths for different layers
  • Fixed configurations: Cannot adaptively adjust based on model characteristics
3

Section 03

Core Innovations of XFP: Quality-Driven Workflow Reversal and Layered Objectives

Reverse Traditional Workflow

Traditional Method: Operator selects bit-width → System performs quantization → Accept result XFP Method: Operator specifies quality lower bound → System automatically determines configuration → Ensure quality

Layered Quality Objective Definition

XFP uses per-channel cosine similarity as the quality metric and sets two types of lower bounds:

  1. Strict lower bound: For attention layers and shared experts
  2. Loose lower bound: For MoE routing experts This reflects the sensitivity differences of different components to model performance
4

Section 04

Technical Implementation of XFP: Weight Decomposition and Storage Modes

Weight Decomposition

Each weight matrix is split into two parts:

  • Sparse FP16 outlier residuals: Capture key outlier weights, stored in full precision; sparse representation reduces overhead
  • Dense sub-byte index tensor: Points to learned codebooks, achieving high compression ratios

Storage Modes

  • V2 mode: Per-channel Lloyd quantization, with each layer independently optimizing codebooks
  • V2a mode: Each layer shares a library of 32 codebooks, further reducing storage

H-Process Memory Adaptation

For models that cannot fit into target memory, iteratively adjust the cosine threshold to ensure the model just fits into memory while maintaining reasonable output. Constraints include operator threshold, OOM boundary, and garbage generation boundary

5

Section 05

Experimental Results: Dual Verification of Performance and Efficiency

Qwen3.5-122B-A10B Performance

  • Inference speed: 138 tok/s for single-stream decoding, 49% faster than Marlin INT4 (TP=1)
  • Accuracy: 94.49% exact match on GSM8K (3 seeds, 3957 samples)

Qwen3.5-397B-A17B Performance

  • Memory efficiency: Full expert group fits into 2x96GB, effective bit-width ~3.4 bits
  • Inference performance: 100.9 tok/s for long-output decoding, 66.72% exact match on GSM8K (1319-question set) All outperform INT4 solutions in memory, throughput, and accuracy
6

Section 06

Technical Advantages and Application Scenarios of XFP

Technical Advantages

  • No calibration data needed: Simplifies deployment process, suitable for data-sensitive scenarios
  • Adaptive configuration: Automatically determines codebook size, outlier budget, and layer packaging strategy
  • Quality assurance: Provides quantifiable quality guarantees via cosine similarity thresholds

Application Scenarios

  • Workstation deployment: Acceleration and memory savings for running large models on consumer hardware
  • Cloud service optimization: Precisely control quality-efficiency trade-offs, optimize resource utilization
  • Edge devices: H-Process automatically adapts to the optimal configuration for memory-constrained devices
7

Section 07

Limitations and Future Directions

Current Limitations

  • Only targets weight quantization; activation quantization remains to be explored
  • Codebook learning increases model loading time
  • Overhead may exceed benefits for extremely small models

Future Directions

  1. Activation quantization expansion: Apply adaptive methods to activation values
  2. Hardware co-design: Deeply optimize with specific hardware architectures
  3. Dynamic adjustment: Dynamically adjust quantization configurations based on load during runtime