Section 01
SFMP Framework: Introduction to an Efficient Mixed-Precision Quantization Solution for Large Language Models
SFMP is a novel mixed-precision quantization framework. Through four key innovations—fractional bitwidth, block-level mixed precision, row-column weight rearrangement, and unified GEMM kernel—it addresses the high search cost and low hardware efficiency issues in traditional methods, achieving an excellent balance between compression ratio and inference efficiency, and is suitable for large language model deployment scenarios.