# XFP: Quality Target-Oriented Adaptive Codebook Quantization and Sparse Outlier Separation Technology

> XFP is a dynamic weight quantizer that reverses the traditional workflow—allowing operators to specify a lower bound for reconstruction quality, while the system automatically determines codebook size, outlier budget, and layer packaging strategy, without the need for Hessian matrices, calibration data, or manual bit-width selection.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-14T13:52:31.000Z
- 最近活动: 2026-05-15T02:52:32.577Z
- 热度: 138.0
- 关键词: LLM量化, 权重量化, 码本量化, 稀疏异常值, 自适应量化, 推理加速, MoE模型, 质量目标
- 页面链接: https://www.zingnex.cn/en/forum/thread/xfp-llm
- Canonical: https://www.zingnex.cn/forum/thread/xfp-llm
- Markdown 来源: floors_fallback

---

## [Introduction] XFP: Quality-Driven Adaptive LLM Quantization Technology

XFP is a quality target-oriented adaptive codebook quantization and sparse outlier separation technology. By reversing the traditional quantization workflow, operators specify a lower bound for reconstruction quality, and the system automatically determines codebook size, outlier budget, and layer packaging strategy—without Hessian matrices, calibration data, or manual bit-width selection—providing a more intuitive and reliable quantization solution for LLM deployment.

## Background: Traditional Dilemmas in LLM Quantization

Large Language Models (LLMs) face memory and computational challenges in inference deployment. Quantization is a key optimization technique, but traditional methods have the following limitations:
- Require Hessian matrices: Dependent on second-order information, leading to high computational costs
- Dependent on calibration data: Need representative datasets to search for quantization parameters
- Manual bit-width selection: Operators must manually select bit-widths for different layers
- Fixed configurations: Cannot adaptively adjust based on model characteristics

## Core Innovations of XFP: Quality-Driven Workflow Reversal and Layered Objectives

### Reverse Traditional Workflow
**Traditional Method**: Operator selects bit-width → System performs quantization → Accept result
**XFP Method**: Operator specifies quality lower bound → System automatically determines configuration → Ensure quality

### Layered Quality Objective Definition
XFP uses per-channel cosine similarity as the quality metric and sets two types of lower bounds:
1. Strict lower bound: For attention layers and shared experts
2. Loose lower bound: For MoE routing experts
This reflects the sensitivity differences of different components to model performance

## Technical Implementation of XFP: Weight Decomposition and Storage Modes

### Weight Decomposition
Each weight matrix is split into two parts:
- **Sparse FP16 outlier residuals**: Capture key outlier weights, stored in full precision; sparse representation reduces overhead
- **Dense sub-byte index tensor**: Points to learned codebooks, achieving high compression ratios

### Storage Modes
- **V2 mode**: Per-channel Lloyd quantization, with each layer independently optimizing codebooks
- **V2a mode**: Each layer shares a library of 32 codebooks, further reducing storage

### H-Process Memory Adaptation
For models that cannot fit into target memory, iteratively adjust the cosine threshold to ensure the model just fits into memory while maintaining reasonable output. Constraints include operator threshold, OOM boundary, and garbage generation boundary

## Experimental Results: Dual Verification of Performance and Efficiency

### Qwen3.5-122B-A10B Performance
- Inference speed: 138 tok/s for single-stream decoding, 49% faster than Marlin INT4 (TP=1)
- Accuracy: 94.49% exact match on GSM8K (3 seeds, 3957 samples)

### Qwen3.5-397B-A17B Performance
- Memory efficiency: Full expert group fits into 2x96GB, effective bit-width ~3.4 bits
- Inference performance: 100.9 tok/s for long-output decoding, 66.72% exact match on GSM8K (1319-question set)
All outperform INT4 solutions in memory, throughput, and accuracy

## Technical Advantages and Application Scenarios of XFP

### Technical Advantages
- No calibration data needed: Simplifies deployment process, suitable for data-sensitive scenarios
- Adaptive configuration: Automatically determines codebook size, outlier budget, and layer packaging strategy
- Quality assurance: Provides quantifiable quality guarantees via cosine similarity thresholds

### Application Scenarios
- Workstation deployment: Acceleration and memory savings for running large models on consumer hardware
- Cloud service optimization: Precisely control quality-efficiency trade-offs, optimize resource utilization
- Edge devices: H-Process automatically adapts to the optimal configuration for memory-constrained devices

## Limitations and Future Directions

### Current Limitations
- Only targets weight quantization; activation quantization remains to be explored
- Codebook learning increases model loading time
- Overhead may exceed benefits for extremely small models

### Future Directions
1. Activation quantization expansion: Apply adaptive methods to activation values
2. Hardware co-design: Deeply optimize with specific hardware architectures
3. Dynamic adjustment: Dynamically adjust quantization configurations based on load during runtime
