Quantization Compression
Convert high-precision floating-point numbers to low-precision representations (e.g., INT4), with a theoretical compression ratio of up to 8x. UltraCompress may use fine-grained techniques such as group quantization, outlier-aware quantization, and learned quantization to balance compression ratio and quality.
Sparseization and Pruning
Identify and remove redundant parameters, divided into structured sparsity (removing neurons/channels) and unstructured sparsity (randomly removing weights). A progressive pruning strategy may be used to adapt to compact structures.
Matrix Decomposition and Low-Rank Approximation
Leverage the low-rank property of weight matrices, decompose into products of small matrices via SVD or other methods, especially suitable for attention layers and fully connected layers, with adaptive selection of optimal strategies.
Knowledge Distillation
Train small student models to mimic the prediction results, soft labels, and intermediate layer representations of large teacher models, inheriting generalization capabilities while maintaining a compact size.