Reading

XFP: Quality Target-Oriented Adaptive Codebook Quantization and Sparse Outlier Separation Technology

XFP is a dynamic weight quantizer that reverses the traditional workflow—allowing operators to specify a lower bound for reconstruction quality, while the system automatically determines codebook size, outlier budget, and layer packaging strategy, without the need for Hessian matrices, calibration data, or manual bit-width selection.

LLM量化权重量化码本量化稀疏异常值自适应量化推理加速MoE模型质量目标

Published 2026-05-14 21:52Recent activity 2026-05-15 10:52Estimated read 7 min

XFP: Quality Target-Oriented Adaptive Codebook Quantization and Sparse Outlier Separation Technology

Section 01

[Introduction] XFP: Quality-Driven Adaptive LLM Quantization Technology

XFP is a quality target-oriented adaptive codebook quantization and sparse outlier separation technology. By reversing the traditional quantization workflow, operators specify a lower bound for reconstruction quality, and the system automatically determines codebook size, outlier budget, and layer packaging strategy—without Hessian matrices, calibration data, or manual bit-width selection—providing a more intuitive and reliable quantization solution for LLM deployment.

Section 02

Background: Traditional Dilemmas in LLM Quantization

Large Language Models (LLMs) face memory and computational challenges in inference deployment. Quantization is a key optimization technique, but traditional methods have the following limitations:

Require Hessian matrices: Dependent on second-order information, leading to high computational costs
Dependent on calibration data: Need representative datasets to search for quantization parameters
Manual bit-width selection: Operators must manually select bit-widths for different layers
Fixed configurations: Cannot adaptively adjust based on model characteristics

Section 03

Core Innovations of XFP: Quality-Driven Workflow Reversal and Layered Objectives

Reverse Traditional Workflow

Traditional Method: Operator selects bit-width → System performs quantization → Accept result XFP Method: Operator specifies quality lower bound → System automatically determines configuration → Ensure quality

Layered Quality Objective Definition

XFP uses per-channel cosine similarity as the quality metric and sets two types of lower bounds:

Strict lower bound: For attention layers and shared experts
Loose lower bound: For MoE routing experts This reflects the sensitivity differences of different components to model performance

Section 04

Technical Implementation of XFP: Weight Decomposition and Storage Modes

Weight Decomposition

Each weight matrix is split into two parts:

Sparse FP16 outlier residuals: Capture key outlier weights, stored in full precision; sparse representation reduces overhead
Dense sub-byte index tensor: Points to learned codebooks, achieving high compression ratios

Storage Modes

V2 mode: Per-channel Lloyd quantization, with each layer independently optimizing codebooks
V2a mode: Each layer shares a library of 32 codebooks, further reducing storage

H-Process Memory Adaptation

For models that cannot fit into target memory, iteratively adjust the cosine threshold to ensure the model just fits into memory while maintaining reasonable output. Constraints include operator threshold, OOM boundary, and garbage generation boundary

Section 05

Experimental Results: Dual Verification of Performance and Efficiency

Qwen3.5-122B-A10B Performance

Inference speed: 138 tok/s for single-stream decoding, 49% faster than Marlin INT4 (TP=1)
Accuracy: 94.49% exact match on GSM8K (3 seeds, 3957 samples)

Qwen3.5-397B-A17B Performance

Memory efficiency: Full expert group fits into 2x96GB, effective bit-width ~3.4 bits
Inference performance: 100.9 tok/s for long-output decoding, 66.72% exact match on GSM8K (1319-question set) All outperform INT4 solutions in memory, throughput, and accuracy

Section 06

Technical Advantages and Application Scenarios of XFP

Technical Advantages

No calibration data needed: Simplifies deployment process, suitable for data-sensitive scenarios
Adaptive configuration: Automatically determines codebook size, outlier budget, and layer packaging strategy
Quality assurance: Provides quantifiable quality guarantees via cosine similarity thresholds

Application Scenarios

Workstation deployment: Acceleration and memory savings for running large models on consumer hardware
Cloud service optimization: Precisely control quality-efficiency trade-offs, optimize resource utilization
Edge devices: H-Process automatically adapts to the optimal configuration for memory-constrained devices

Section 07

Limitations and Future Directions

Current Limitations

Only targets weight quantization; activation quantization remains to be explored
Codebook learning increases model loading time
Overhead may exceed benefits for extremely small models

Future Directions

Activation quantization expansion: Apply adaptive methods to activation values
Hardware co-design: Deeply optimize with specific hardware architectures
Dynamic adjustment: Dynamically adjust quantization configurations based on load during runtime

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15