Zing Forum

Reading

Chimere: A Rust Inference Engine for Running 35-Billion-Parameter MoE Models on Consumer GPUs

Chimere is a Rust inference runtime designed specifically for local hybrid SSM+MoE architectures. It can run the Qwen3.5-35B-A3B model on a single 16GB consumer GPU at 94 tokens per second, without the need for H100 or multi-GPU setups.

RustMoELLM推理本地部署Qwen3.5CUDABlackwell量化消费级GPU
Published 2026-04-24 18:13Recent activity 2026-04-24 18:19Estimated read 7 min
Chimere: A Rust Inference Engine for Running 35-Billion-Parameter MoE Models on Consumer GPUs
1

Section 01

Chimere: Guide to the Rust Inference Engine for Running 35-Billion-Parameter MoE Models on Consumer GPUs

Core Guide to the Chimere Project

Chimere is an inference runtime entirely written in Rust, optimized for hybrid State Space Model (SSM) and Mixture of Experts (MoE) architectures. Its core breakthrough is: It can smoothly run the 35-billion-parameter Qwen3.5-35B-A3B model on a single consumer GPU with 16GB VRAM (e.g., RTX 5060 Ti) at a generation speed of approximately 94 tokens per second, without the need for high-end data center GPUs. The project supports OpenAI-compatible APIs, balancing performance, deployment convenience, and data privacy requirements.

2

Section 02

Project Background and Core Positioning

Project Background and Core Positioning

Large language model inference has long faced the pain point of "difficulty running large-parameter models with limited hardware resources". The core goal of the Chimere project is to break this barrier: for Qwen3.5-35B-A3B (35 billion parameters, Gated DeltaNet+MoE architecture), it enables efficient operation on consumer GPUs, allowing ordinary developers and users to enjoy large model inference capabilities without relying on high-end hardware like H100.

3

Section 03

Technical Architecture and Core Optimizations

Technical Architecture and Core Optimizations

  1. Tech Stack Foundation: Based on a deeply customized branch of ik_llama.cpp (supports Mamba-2/Nemotron-H architectures, PR submitted to the community), implemented end-to-end in Rust, compiled into a single binary, and provides OpenAI-compatible HTTP services via the axum framework.
  2. Multi-Architecture Scheduling: Automatically routes requests via the AppStateModel enum; adding new architectures only requires extending the enum and loaders.
  3. Engram Memory System: n-gram log bias mechanism, pre-built 4 domain tables (kine/code/cyber/general), indexed via FNV-1a hash and Cuckoo filter, enabling token-level personalization.
  4. CUDA and Quantization Optimizations: Natively supports NVIDIA Blackwell architecture (sm_120), uses TurboQuant-style K-cache optimization (Hadamard rotated keys + Q8_0/Q4_0 KV quantization), improving throughput by 8% with almost no quality loss.
4

Section 04

Performance Benchmarks and Real-World Performance

Performance Benchmarks and Real-World Performance

According to official tests, Chimere's performance on RTX 5060 Ti is as follows:

  • Qwen3.5-35B-A3B (custom IQK quantization): ~80 tokens/sec generation speed under 64K context, 789 tokens/sec prefill, first token latency of 80ms, VRAM usage of 15.3GB;
  • Nemotron-3-Nano-30B-A3B (Q4_0 quantization): ~45 tokens/sec generation speed. These data prove that consumer hardware can achieve a response experience close to cloud APIs.
5

Section 05

Multi-Model Support and Deployment Ecosystem

Multi-Model Support and Deployment Ecosystem

  1. Multi-Model Compatibility: In addition to the Qwen3.5 series, it has verified support for hybrid architecture models like Nemotron-3-Nano-30B-A3B; future plans include expanding to Granite 4.0, Falcon-H1, etc.
  2. Deployment Process: Need to clone and build the ik_llama.cpp backend and chimere-server, with dependencies on CUDA 12.8+ and Rust 1.80+; configure parameters like model path via environment variables, and after startup, it provides OpenAI-compatible APIs (supports streaming chat, tool calls, etc.).
  3. Ecosystem: As part of the AIdevsmartdata ecosystem, supporting projects include chimere-odo (Python orchestrator), chimere-studio (Tauri UI), ramp-quant (quantization pipeline), etc.
6

Section 06

Conclusion and Future Outlook

Conclusion and Future Outlook

Chimere, through system-level optimizations (Rust performance, CUDA kernels, quantization strategies), proves that consumer hardware can handle large model inference tasks, promoting AI democratization and edge computing development. In the future, it will continue to expand model support and is expected to become one of the preferred runtimes for local LLM deployment, providing reliable solutions for data privacy-sensitive scenarios.