Reading

EdgeRazor: A Lightweight Compression Framework for Large Language Models on Edge Devices

The EdgeRazor framework, open-sourced by the Nanjing University team, enables efficient deployment of large language models (LLMs) on edge devices via mixed-precision quantization-aware distillation. It supports quantization precisions ranging from 1.58-bit to 4-bit, achieving a maximum compression ratio of 7.03x on the Qwen3-0.6B model.

EdgeRazor模型量化知识蒸馏边缘AI大语言模型模型压缩混合精度Qwen3端侧部署

Published 2026-04-29 14:12Recent activity 2026-04-29 14:18Estimated read 8 min

EdgeRazor: A Lightweight Compression Framework for Large Language Models on Edge Devices

Section 01

EdgeRazor Framework Overview: An Efficient Solution for LLM Lightweighting on Edge Devices

The EdgeRazor framework, open-sourced by the Nanjing University team, enables efficient deployment of large language models (LLMs) on edge devices via mixed-precision quantization-aware distillation. It supports quantization precisions from 1.58-bit to 4-bit, achieving a maximum compression ratio of 7.03x on the Qwen3-0.6B model, effectively balancing extreme compression and model capability.

Section 02

Background: Challenges in Edge Deployment of LLMs and Limitations of Traditional Methods

In recent years, large language models have demonstrated strong capabilities across multiple domains, but their parameter scales often reach billions or even tens of billions, making them difficult to run on resource-constrained environments like mobile devices and edge nodes. Traditional model compression methods (pruning, quantization, distillation) are independent of each other, requiring developers to piece together tools; low-bit quantization is often accompanied by significant performance loss, so balancing compression and capability is a focus in academia and industry.

Section 03

Core of EdgeRazor Framework: Joint Optimization Strategy of Quantization-Aware Distillation

EdgeRazor is an open-source lightweight framework developed by the Nanjing University team, designed specifically for edge AI scenarios. Its core idea is to seamlessly integrate model compression techniques into the full-precision training process, enabling efficient lightweighting with minimal code modifications. Unlike the two-stage process of quantization followed by fine-tuning, EdgeRazor adopts Quantization-Aware Distillation (QAD), which considers both quantization noise and knowledge transfer during training, allowing low-bit student models to inherit the capabilities of the teacher model.

Section 04

Core Technical Details: Mixed-Precision Quantization and Multi-Dimensional Distillation

Mixed-Precision Quantization Support

EdgeRazor supports independent quantization of weights, activations, and KV caches, and provides a matrix-level mixed-precision mechanism (e.g., 2.79-bit: 50% of weights at 4-bit +50% at1.58-bit;1.88-bit:12.5% of weights at4-bit +87.5% at1.58-bit) to adapt to the quantization sensitivity of parameters in different layers.

Multi-Dimensional Knowledge Distillation

Logits Distillation: Aligns output distribution and confidence using Entropy-Aware KL Divergence (EAKLD);
Feature Distillation: Selects valuable feature layers for alignment using Adaptive Feature Distillation (AFD);
Attention Distillation: Transfers the attention patterns of Transformers.

Unified Configuration Interface

Supports YAML, JSON, and Python dictionary formats for declarative configuration of quantization strategies, distillation methods, etc., reducing learning costs and reproducibility difficulty.

Section 05

Experimental Validation: Compression Effect and Performance Advantages on Qwen3-0.6B

EdgeRazor in Qwen3-0.6B model experimental results:

Configuration	Average Score	Compression Ratio
4-bit	47.80	3.94×
2.79-bit	44.10	5.05×
1.88-bit	41.76	6.40×
1.58-bit	39.81	7.03×

The 4-bit configuration score (47.80) exceeds the full-precision baseline (47.35); at the same bit width, EdgeRazor's scores in the 3-bit/2-bit range are significantly ahead of traditional methods, proving the effectiveness of the mixed-precision strategy.

Section 06

Deployment Practice: Ecosystem Support and Edge Adaptation of EdgeRazor

EdgeRazor lowers the threshold for edge deployment:

Hugging Face Ecosystem: Quantized models can be uploaded to the Hub and loaded via Transformers; GGUF (llama.cpp) and GPTQ formats for Qwen3-0.6B/1.7B have been released;
CPU-Friendly Inference: Runs on pure CPU via the GGUF format of llama.cpp, with EdgeRazor Playground (Gradio interactive demo) provided;
Docker Deployment: Pre-configured scripts support service-oriented deployment.

Section 07

Limitations and Future: Improvement Directions of EdgeRazor

Technical Limitations

Quantization-aware distillation requires a full-precision teacher model, leading to high training costs for ultra-large-scale models;
The mixed-precision strategy is empirically configured and needs automated precision allocation;
Currently, it mainly supports Transformer-based language models.

Future Directions

The team plans to support lightweight ViT-S/16 and ResNet-18 image classification models, as well as compression of the multimodal model Qwen2.5-Omni-7B.

Section 08

Summary: Significance and Outlook of EdgeRazor for Edge AI

EdgeRazor is an important advancement in edge deployment of LLMs. Through the combination of quantization and distillation, as well as the mixed-precision strategy, it maintains model capability under extreme compression. Being open-source (MIT license) and community-maintained facilitates its implementation, and it will play an important role in the model democratization process, providing a solution for running LLMs on resource-constrained devices.