Zing Forum

Reading

Running Qwen3.5-35B MoE on RTX 5090: Practice of NVFP4 Quantization and vLLM Optimization

This project demonstrates how to efficiently run the Qwen3.5-35B-A3B MoE large model on a single RTX 5090 graphics card using NVFP4 quantization technology, supporting a 256K context window and a generation speed of 200 tokens per second.

Qwen3.5MoENVFP4量化RTX 5090vLLM大模型推理长上下文模型量化
Published 2026-04-06 06:12Recent activity 2026-04-06 06:19Estimated read 5 min
Running Qwen3.5-35B MoE on RTX 5090: Practice of NVFP4 Quantization and vLLM Optimization
1

Section 01

Introduction: A Breakthrough in Running Large Models on Consumer GPUs—Efficient Deployment of Qwen3.5-35B MoE on RTX5090

This project achieves efficient operation of the Qwen3.5-35B-A3B MoE large model on a single NVIDIA RTX 5090 graphics card through NVFP4 quantization technology and vLLM inference engine optimization, supporting a 256K context window and a generation speed of 200 tokens per second, providing a practical reference for local deployment of large models on consumer hardware.

2

Section 02

Background: Challenges of Running Large Models on Consumer Hardware and Advantages of MoE Architecture

With the growth of parameter scales in large language models, efficiently running large models on consumer hardware has become a focus. Qwen3.5-35B-A3B adopts a Mixture of Experts (MoE) architecture, with a total parameter count of 35 billion but only activating 3 billion parameters per inference. This retains knowledge storage capabilities while controlling computational overhead, laying the foundation for consumer-level deployment.

3

Section 03

Key Technical Methods: NVFP4 Quantization and vLLM Inference Optimization

  1. NVFP4 Quantization: A 4-bit floating-point format introduced by NVIDIA. Compared to FP32, it reduces storage requirements to 1/8, lowering the Qwen3.5-35B model weights from approximately 140GB to 17.5GB, which fits the 32GB VRAM of RTX5090.
  2. vLLM Optimization: Uses the PagedAttention algorithm (KV cache paging management) and FP8 KV cache technology to improve memory utilization efficiency and batch processing capabilities.
  3. Long Context Support: The combination of quantization + FP8 KV cache + PagedAttention solves the memory and computational challenges of 256K context windows.
4

Section 04

Hardware Configuration and Performance Evidence

Hardware Requirements: RTX5090 graphics card (≥32GB VRAM), Intel Core i7/AMD Ryzen7 or higher CPU, 32GB+ RAM, 10GB+ storage space; Performance Data: Achieves a generation speed of approximately 200 tokens per second and supports a 256K context window.

5

Section 05

Application Scenarios and Practical Value

Application scenarios include localized AI assistants (offline operation, data privacy protection), long document processing (whole book/report analysis), and professional knowledge base Q&A; Practical value: Reduces reliance on cloud APIs, protects data security, lowers usage costs, and facilitates local experimental iteration for researchers.

6

Section 06

Optimization Suggestions and Future Outlook

Optimization Suggestions: Adjust context length according to task requirements (shorter length improves speed), balance batch size (trade-off between throughput and latency), keep graphics card drivers and CUDA versions up to date; Limitations: Dependent on RTX5090 architecture, quantization has certain precision loss; Future Outlook: The evolution of GPU architecture and progress in quantization technology will support consumer-level deployment of larger models, promoting the democratization of AI technology.