Section 01
Introduction: A Breakthrough in Running Large Models on Consumer GPUs—Efficient Deployment of Qwen3.5-35B MoE on RTX5090
This project achieves efficient operation of the Qwen3.5-35B-A3B MoE large model on a single NVIDIA RTX 5090 graphics card through NVFP4 quantization technology and vLLM inference engine optimization, supporting a 256K context window and a generation speed of 200 tokens per second, providing a practical reference for local deployment of large models on consumer hardware.