# NVFP4 Quantization Breakthrough: Running Qwen3.5-35B MoE Large Model on a Single RTX 5090 Card

> This article introduces how to efficiently run the Qwen3.5-35B MoE model on a single RTX 5090 graphics card using NVIDIA's latest NVFP4 quantization technology. Through the vLLM inference engine and 4-bit floating-point quantization, high-performance deployment of large models on consumer-grade hardware is achieved.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-23T12:46:44.000Z
- 最近活动: 2026-04-23T12:53:07.294Z
- 热度: 163.9
- 关键词: NVFP4, Qwen3.5, MoE, vLLM, RTX 5090, 模型量化, 大模型推理, Blackwell架构, 消费级GPU, 4位量化
- 页面链接: https://www.zingnex.cn/en/forum/thread/rtx-5090qwen3-5-35b-moe-nvfp4vllm
- Canonical: https://www.zingnex.cn/forum/thread/rtx-5090qwen3-5-35b-moe-nvfp4vllm
- Markdown 来源: floors_fallback

---

## [Main Floor] NVFP4 Quantization Breakthrough: Guide to Running Qwen3.5-35B MoE Large Model on a Single RTX5090 Card

This article introduces how to efficiently run the Qwen3.5-35B MoE model on a single RTX 5090 graphics card using NVIDIA's latest NVFP4 quantization technology combined with the vLLM inference engine, enabling high-performance deployment of large models on consumer-grade hardware and breaking the limitation that traditional large-parameter models rely on multiple high-end graphics cards or professional acceleration cards.

## [Background] Challenges of Running Large Models on Consumer GPUs and the Breakthrough of NVFP4

The parameter scale of large language models continues to grow; traditionally, running a 35-billion-parameter model requires multiple high-end graphics cards or professional AI acceleration cards. With the advancement of model quantization technology, especially the NVFP4 format introduced by NVIDIA's Blackwell architecture, running large-parameter models on consumer-grade graphics cards has become a reality.

## [Method 1] Introduction to Qwen3.5-35B MoE Model

Qwen3.5 is a new generation of large language model series launched by Alibaba Cloud's Tongyi Qianwen team. The 35B MoE version adopts a sparsely activated architecture with a total of 35 billion parameters, but only about 2-4 billion parameters are activated per forward pass, which greatly reduces the inference computation while maintaining strong capabilities. The core advantages of the MoE architecture include: 1. Parameter efficiency: Through the expert routing mechanism, the model can have more total parameters without increasing inference costs; 2. Specialized capabilities: Different expert networks are optimized for specific tasks or knowledge domains; 3. Scalability: Naturally supports increasing the number of experts to enhance model capacity.

## [Method 2] Analysis of NVIDIA NVFP4 Quantization Technology

NVFP4 is a 4-bit floating-point quantization format introduced by NVIDIA in the Blackwell architecture. Compared with traditional INT4/INT8 quantization, it has advantages such as dynamic range preservation (floating-point format better represents weights and activation values with a large numerical distribution range, reducing precision loss), native hardware support (RTX50 series GPUs have built-in NVFP4 computing units for efficient execution), and fine-grained scaling (block-wise scaling factors adapt to different parameter distributions). Comparison table:
| Quantization Format | Bit Width | Precision Loss | Hardware Support | Application Scenario |
|---------------------|-----------|----------------|------------------|----------------------|
| FP16                | 16 bits   | None           | Wide             | Training & Inference |
| INT8                | 8 bits    | Low            | Wide             | General Inference    |
| INT4                | 4 bits    | Medium         | Partial          | Resource-Constrained Scenarios |
| NVFP4               | 4 bits    | Lower          | Blackwell+       | Next-Gen Inference   |

## [Technical Support] vLLM Inference Engine and RTX5090 Hardware Capabilities

vLLM is a high-throughput LLM inference engine developed by the Sky Computing Lab at the University of California, Berkeley. It uses the PagedAttention algorithm to optimize memory management, with functions including: 1. Continuous batching: dynamically combining multiple requests to maximize GPU utilization; 2. Paged attention cache: dividing KV cache into fixed-size blocks to reduce memory fragmentation and redundant computation; 3. Quantization-aware scheduling: optimizing memory access patterns for 4-bit quantized models to leverage the hardware acceleration advantages of NVFP4. As the flagship consumer-grade graphics card of the Blackwell architecture, the RTX5090 has: 32GB GDDR7 memory (accommodates about 17-18GB of 4-bit quantized 35B model and reserves KV cache), NVFP4 acceleration (Tensor Core natively supports 4-bit floating-point operations), and high-bandwidth memory (efficiently reads model weights).

## [Deployment Practice] Environment and Optimization Points for Qwen3.5-35B MoE Model Deployment

Environment preparation requires: 1. CUDA 12.8+ (Blackwell architecture needs the latest toolchain); 2. vLLM 0.11+ (supports Blackwell and NVFP4); 3. NVFP4 format weights (preprocessed with TensorRT-LLM or AutoGPTQ). Performance optimization points: set a reasonable maximum sequence length (e.g., 8K/16K/32K), balance batch size, and configure vLLM's `gpu_memory_utilization` parameter to reserve KV cache space. Extending context length: supporting longer context windows through Rotary Position Encoding (RoPE) scaling and vLLM memory optimization is crucial for tasks such as document analysis and code understanding.

## [Applications and Significance] Application Scenarios and Value of Single-Card Large Model Deployment

Application scenarios include: 1. Localized AI assistant: running locally protects privacy and reduces latency; 2. Development and testing environment: personal workstations can iterate and debug without expensive servers; 3. Edge inference nodes: simplifies system architecture in edge scenarios; 4. Model fine-tuning experiments: lowers the threshold for inference and promotes application innovation.

## [Conclusion and Outlook] Development Trends of AI Inference Hardware and Quantization Technology

NVFP4 marks a new stage in AI inference hardware: 1. Quantization precision moves from barely usable to production-ready; 2. The gap between consumer-grade and professional-grade hardware narrows; 3. Collaborative design of model architecture and hardware (a virtuous cycle between MoE and quantization). In the future, model distillation, quantization, and hardware acceleration will progress in synergy, and billion-parameter models will run on a wider range of devices, promoting the democratization of AI.
