Reading

NVFP4 Quantization Breakthrough: Running Qwen3.5-35B MoE Large Model on a Single RTX 5090 Card

This article introduces how to efficiently run the Qwen3.5-35B MoE model on a single RTX 5090 graphics card using NVIDIA's latest NVFP4 quantization technology. Through the vLLM inference engine and 4-bit floating-point quantization, high-performance deployment of large models on consumer-grade hardware is achieved.

NVFP4Qwen3.5MoEvLLMRTX 5090模型量化大模型推理Blackwell架构消费级GPU4位量化

Published 2026-04-23 20:46Recent activity 2026-04-23 20:53Estimated read 9 min

NVFP4 Quantization Breakthrough: Running Qwen3.5-35B MoE Large Model on a Single RTX 5090 Card

Section 01

[Main Floor] NVFP4 Quantization Breakthrough: Guide to Running Qwen3.5-35B MoE Large Model on a Single RTX5090 Card

This article introduces how to efficiently run the Qwen3.5-35B MoE model on a single RTX 5090 graphics card using NVIDIA's latest NVFP4 quantization technology combined with the vLLM inference engine, enabling high-performance deployment of large models on consumer-grade hardware and breaking the limitation that traditional large-parameter models rely on multiple high-end graphics cards or professional acceleration cards.

Section 02

[Background] Challenges of Running Large Models on Consumer GPUs and the Breakthrough of NVFP4

The parameter scale of large language models continues to grow; traditionally, running a 35-billion-parameter model requires multiple high-end graphics cards or professional AI acceleration cards. With the advancement of model quantization technology, especially the NVFP4 format introduced by NVIDIA's Blackwell architecture, running large-parameter models on consumer-grade graphics cards has become a reality.

Section 03

[Method 1] Introduction to Qwen3.5-35B MoE Model

Qwen3.5 is a new generation of large language model series launched by Alibaba Cloud's Tongyi Qianwen team. The 35B MoE version adopts a sparsely activated architecture with a total of 35 billion parameters, but only about 2-4 billion parameters are activated per forward pass, which greatly reduces the inference computation while maintaining strong capabilities. The core advantages of the MoE architecture include: 1. Parameter efficiency: Through the expert routing mechanism, the model can have more total parameters without increasing inference costs; 2. Specialized capabilities: Different expert networks are optimized for specific tasks or knowledge domains; 3. Scalability: Naturally supports increasing the number of experts to enhance model capacity.

Section 04

[Method 2] Analysis of NVIDIA NVFP4 Quantization Technology

NVFP4 is a 4-bit floating-point quantization format introduced by NVIDIA in the Blackwell architecture. Compared with traditional INT4/INT8 quantization, it has advantages such as dynamic range preservation (floating-point format better represents weights and activation values with a large numerical distribution range, reducing precision loss), native hardware support (RTX50 series GPUs have built-in NVFP4 computing units for efficient execution), and fine-grained scaling (block-wise scaling factors adapt to different parameter distributions). Comparison table:

Quantization Format	Bit Width	Precision Loss	Hardware Support	Application Scenario
FP16	16 bits	None	Wide	Training & Inference
INT8	8 bits	Low	Wide	General Inference
INT4	4 bits	Medium	Partial	Resource-Constrained Scenarios
NVFP4	4 bits	Lower	Blackwell+	Next-Gen Inference

Section 05

[Technical Support] vLLM Inference Engine and RTX5090 Hardware Capabilities

vLLM is a high-throughput LLM inference engine developed by the Sky Computing Lab at the University of California, Berkeley. It uses the PagedAttention algorithm to optimize memory management, with functions including: 1. Continuous batching: dynamically combining multiple requests to maximize GPU utilization; 2. Paged attention cache: dividing KV cache into fixed-size blocks to reduce memory fragmentation and redundant computation; 3. Quantization-aware scheduling: optimizing memory access patterns for 4-bit quantized models to leverage the hardware acceleration advantages of NVFP4. As the flagship consumer-grade graphics card of the Blackwell architecture, the RTX5090 has: 32GB GDDR7 memory (accommodates about 17-18GB of 4-bit quantized 35B model and reserves KV cache), NVFP4 acceleration (Tensor Core natively supports 4-bit floating-point operations), and high-bandwidth memory (efficiently reads model weights).

Section 06

[Deployment Practice] Environment and Optimization Points for Qwen3.5-35B MoE Model Deployment

Environment preparation requires: 1. CUDA 12.8+ (Blackwell architecture needs the latest toolchain); 2. vLLM 0.11+ (supports Blackwell and NVFP4); 3. NVFP4 format weights (preprocessed with TensorRT-LLM or AutoGPTQ). Performance optimization points: set a reasonable maximum sequence length (e.g., 8K/16K/32K), balance batch size, and configure vLLM's gpu_memory_utilization parameter to reserve KV cache space. Extending context length: supporting longer context windows through Rotary Position Encoding (RoPE) scaling and vLLM memory optimization is crucial for tasks such as document analysis and code understanding.

Section 07

[Applications and Significance] Application Scenarios and Value of Single-Card Large Model Deployment

Application scenarios include: 1. Localized AI assistant: running locally protects privacy and reduces latency; 2. Development and testing environment: personal workstations can iterate and debug without expensive servers; 3. Edge inference nodes: simplifies system architecture in edge scenarios; 4. Model fine-tuning experiments: lowers the threshold for inference and promotes application innovation.

Section 08

[Conclusion and Outlook] Development Trends of AI Inference Hardware and Quantization Technology

NVFP4 marks a new stage in AI inference hardware: 1. Quantization precision moves from barely usable to production-ready; 2. The gap between consumer-grade and professional-grade hardware narrows; 3. Collaborative design of model architecture and hardware (a virtuous cycle between MoE and quantization). In the future, model distillation, quantization, and hardware acceleration will progress in synergy, and billion-parameter models will run on a wider range of devices, promoting the democratization of AI.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15