Section 01
Chimere: Guide to the Rust Inference Engine for Running 35-Billion-Parameter MoE Models on Consumer GPUs
Core Guide to the Chimere Project
Chimere is an inference runtime entirely written in Rust, optimized for hybrid State Space Model (SSM) and Mixture of Experts (MoE) architectures. Its core breakthrough is: It can smoothly run the 35-billion-parameter Qwen3.5-35B-A3B model on a single consumer GPU with 16GB VRAM (e.g., RTX 5060 Ti) at a generation speed of approximately 94 tokens per second, without the need for high-end data center GPUs. The project supports OpenAI-compatible APIs, balancing performance, deployment convenience, and data privacy requirements.