Reading

Large Model Quantization Practice on Huawei Ascend NPU: Technical Analysis of vLLM-ascend-quant-hust

The Ascend NPU quantization tool open-sourced by the Huazhong University of Science and Technology team supports the deployment of large language models with W8A8 and W4A4 precision, providing an efficient model compression solution for the domestic AI chip ecosystem.

华为昇腾NPU大模型量化训练后量化W8A8W4A4Qwen模型压缩国产AI芯片msmodelslim

Published 2026-06-10 14:15Recent activity 2026-06-10 14:19Estimated read 7 min

Section 01

[Introduction] Large Model Quantization Practice on Huawei Ascend NPU: Technical Analysis of vLLM-ascend-quant-hust

The vLLM-ascend-quant-hust project open-sourced by the Huazhong University of Science and Technology team provides a post-training quantization solution specifically for Huawei Ascend NPUs, supporting the deployment of large language models with W8A8 and W4A4 precision. It fills the gap in model compression for domestic AI chips and facilitates the deployment of large models on localized computing infrastructure. Currently, the project mainly supports the Qwen (Tongyi Qianwen) series models, lowering the technical threshold for developers to deploy quantized models on the Ascend platform.

Section 02

Background: Computing Power Dilemma in Large Model Deployment and the Value of Quantization Technology

As the parameter scale of large language models grows, the computing resources and memory bandwidth required for inference have become deployment bottlenecks. As a representative domestic AI chip, Huawei Ascend NPU needs to optimize quantization technology for its Da Vinci architecture. Quantization reduces the precision of weights and activation values, which can significantly reduce memory usage and accelerate computation: FP32 to INT8 theoretically compresses by 4x, and low-precision schemes like W4A4 have irreplaceable value in edge devices.

Section 03

Project Overview: Post-Training Quantization Solution Exclusive to Ascend NPU

vLLM-ascend-quant-hust is based on Huawei's msmodelslim toolchain and supports W8A8 and W4A4 quantization. The project mainly adapts to Qwen2.5 and Qwen3 architectures, providing out-of-the-box configuration files and scripts to lower the threshold for deploying quantized models on the Ascend platform, reflecting the trend of collaborative development between domestic large models and computing chips.

Section 04

Core Technology: Details of W8A8 and W4A4 Quantization Schemes

W8A8 Quantization

One-click quantization via the msmodelslim quant command-line tool, supporting Hugging Face model loading, NPU device specification, Qwen architecture optimization, and calibration configuration. It can reduce memory usage by about 50% and increase inference throughput by 1.5-2x (7B model memory reduces from 14GB to 7GB).

W4A4 Quantization

Uses custom scripts to implement group quantization and mixed precision, minimizes errors with a dedicated calibration dataset (qwen_qwen3_cot_w4a4.json), supports Qwen3/Qwen2.5, and batch size 1 is suitable for single-sample inference scenarios.

Quality Evaluation

Built-in PPL evaluation script to verify the prediction capability of quantized models, helping developers confirm usability before deployment.

Section 05

Technical Ecosystem: Dependent on Huawei's Full-Stack AI Software System

The project's core dependencies include:

CANN: Ascend chip's underlying driver and runtime
msmodelslim: Huawei's official model compression toolkit
PyTorch Ascend-adapted version: Upper-layer deep learning framework Users need to configure a complete Ascend development environment (driver, CANN toolkit, Python virtual environment). The project simplifies the process via requirements.txt and conda configurations.

Section 06

Application Scenarios: Localized Deployment and Ecosystem Value

Typical application scenarios:

Government/Finance: Meet security requirements for localized deployment
Edge Computing: Run large models on resource-constrained devices
High-concurrency Services: Increase single-card throughput and reduce hardware costs The open-sourcing of the project helps break the monopoly of NVIDIA's CUDA ecosystem and promotes the diversified development of AI infrastructure.

Section 07

Limitations and Outlook: Future Optimization Directions

Current limitations: Only supports Qwen series models, with limited generality. Future directions:

Expand support for mainstream models like LLaMA and Baichuan
Implement dynamic quantization strategies
Deeply integrate with the vLLM inference engine
Support KV Cache quantization to reduce memory pressure for long-sequence inference

Section 08

Conclusion: Important Progress in Domestic AI Chip Ecosystem Construction

vLLM-ascend-quant-hust is an important progress in the domestic AI chip software ecosystem. It adapts internationally advanced quantization technology to the Ascend platform, providing a feasible path for deploying large models on localized infrastructure. As domestic large models and computing chips iterate, the value of such underlying tools will become increasingly prominent.