Zing Forum

Reading

NVFP4 Quantization Breakthrough: Running Qwen3.5-35B MoE Large Model on a Single RTX 5090 Card

This article introduces how to efficiently run the Qwen3.5-35B MoE model on a single RTX 5090 graphics card using NVIDIA's latest NVFP4 quantization technology. Through the vLLM inference engine and 4-bit floating-point quantization, high-performance deployment of large models on consumer-grade hardware is achieved.

NVFP4Qwen3.5MoEvLLMRTX 5090模型量化大模型推理Blackwell架构消费级GPU4位量化
Published 2026-04-23 20:46Recent activity 2026-04-23 20:53Estimated read 9 min
NVFP4 Quantization Breakthrough: Running Qwen3.5-35B MoE Large Model on a Single RTX 5090 Card
1

Section 01

[Main Floor] NVFP4 Quantization Breakthrough: Guide to Running Qwen3.5-35B MoE Large Model on a Single RTX5090 Card

This article introduces how to efficiently run the Qwen3.5-35B MoE model on a single RTX 5090 graphics card using NVIDIA's latest NVFP4 quantization technology combined with the vLLM inference engine, enabling high-performance deployment of large models on consumer-grade hardware and breaking the limitation that traditional large-parameter models rely on multiple high-end graphics cards or professional acceleration cards.

2

Section 02

[Background] Challenges of Running Large Models on Consumer GPUs and the Breakthrough of NVFP4

The parameter scale of large language models continues to grow; traditionally, running a 35-billion-parameter model requires multiple high-end graphics cards or professional AI acceleration cards. With the advancement of model quantization technology, especially the NVFP4 format introduced by NVIDIA's Blackwell architecture, running large-parameter models on consumer-grade graphics cards has become a reality.

3

Section 03

[Method 1] Introduction to Qwen3.5-35B MoE Model

Qwen3.5 is a new generation of large language model series launched by Alibaba Cloud's Tongyi Qianwen team. The 35B MoE version adopts a sparsely activated architecture with a total of 35 billion parameters, but only about 2-4 billion parameters are activated per forward pass, which greatly reduces the inference computation while maintaining strong capabilities. The core advantages of the MoE architecture include: 1. Parameter efficiency: Through the expert routing mechanism, the model can have more total parameters without increasing inference costs; 2. Specialized capabilities: Different expert networks are optimized for specific tasks or knowledge domains; 3. Scalability: Naturally supports increasing the number of experts to enhance model capacity.

4

Section 04

[Method 2] Analysis of NVIDIA NVFP4 Quantization Technology

NVFP4 is a 4-bit floating-point quantization format introduced by NVIDIA in the Blackwell architecture. Compared with traditional INT4/INT8 quantization, it has advantages such as dynamic range preservation (floating-point format better represents weights and activation values with a large numerical distribution range, reducing precision loss), native hardware support (RTX50 series GPUs have built-in NVFP4 computing units for efficient execution), and fine-grained scaling (block-wise scaling factors adapt to different parameter distributions). Comparison table:

Quantization Format Bit Width Precision Loss Hardware Support Application Scenario
FP16 16 bits None Wide Training & Inference
INT8 8 bits Low Wide General Inference
INT4 4 bits Medium Partial Resource-Constrained Scenarios
NVFP4 4 bits Lower Blackwell+ Next-Gen Inference
5

Section 05

[Technical Support] vLLM Inference Engine and RTX5090 Hardware Capabilities

vLLM is a high-throughput LLM inference engine developed by the Sky Computing Lab at the University of California, Berkeley. It uses the PagedAttention algorithm to optimize memory management, with functions including: 1. Continuous batching: dynamically combining multiple requests to maximize GPU utilization; 2. Paged attention cache: dividing KV cache into fixed-size blocks to reduce memory fragmentation and redundant computation; 3. Quantization-aware scheduling: optimizing memory access patterns for 4-bit quantized models to leverage the hardware acceleration advantages of NVFP4. As the flagship consumer-grade graphics card of the Blackwell architecture, the RTX5090 has: 32GB GDDR7 memory (accommodates about 17-18GB of 4-bit quantized 35B model and reserves KV cache), NVFP4 acceleration (Tensor Core natively supports 4-bit floating-point operations), and high-bandwidth memory (efficiently reads model weights).

6

Section 06

[Deployment Practice] Environment and Optimization Points for Qwen3.5-35B MoE Model Deployment

Environment preparation requires: 1. CUDA 12.8+ (Blackwell architecture needs the latest toolchain); 2. vLLM 0.11+ (supports Blackwell and NVFP4); 3. NVFP4 format weights (preprocessed with TensorRT-LLM or AutoGPTQ). Performance optimization points: set a reasonable maximum sequence length (e.g., 8K/16K/32K), balance batch size, and configure vLLM's gpu_memory_utilization parameter to reserve KV cache space. Extending context length: supporting longer context windows through Rotary Position Encoding (RoPE) scaling and vLLM memory optimization is crucial for tasks such as document analysis and code understanding.

7

Section 07

[Applications and Significance] Application Scenarios and Value of Single-Card Large Model Deployment

Application scenarios include: 1. Localized AI assistant: running locally protects privacy and reduces latency; 2. Development and testing environment: personal workstations can iterate and debug without expensive servers; 3. Edge inference nodes: simplifies system architecture in edge scenarios; 4. Model fine-tuning experiments: lowers the threshold for inference and promotes application innovation.

8

Section 08

[Conclusion and Outlook] Development Trends of AI Inference Hardware and Quantization Technology

NVFP4 marks a new stage in AI inference hardware: 1. Quantization precision moves from barely usable to production-ready; 2. The gap between consumer-grade and professional-grade hardware narrows; 3. Collaborative design of model architecture and hardware (a virtuous cycle between MoE and quantization). In the future, model distillation, quantization, and hardware acceleration will progress in synergy, and billion-parameter models will run on a wider range of devices, promoting the democratization of AI.