Reading

Chimere: A Rust Inference Engine for Running 35-Billion-Parameter MoE Models on Consumer GPUs

Chimere is a Rust inference runtime designed specifically for local hybrid SSM+MoE architectures. It can run the Qwen3.5-35B-A3B model on a single 16GB consumer GPU at 94 tokens per second, without the need for H100 or multi-GPU setups.

RustMoELLM推理本地部署Qwen3.5CUDABlackwell量化消费级GPU

Published 2026-04-24 18:13Recent activity 2026-04-24 18:19Estimated read 7 min

Chimere: A Rust Inference Engine for Running 35-Billion-Parameter MoE Models on Consumer GPUs

Section 01

Chimere: Guide to the Rust Inference Engine for Running 35-Billion-Parameter MoE Models on Consumer GPUs

Core Guide to the Chimere Project

Chimere is an inference runtime entirely written in Rust, optimized for hybrid State Space Model (SSM) and Mixture of Experts (MoE) architectures. Its core breakthrough is: It can smoothly run the 35-billion-parameter Qwen3.5-35B-A3B model on a single consumer GPU with 16GB VRAM (e.g., RTX 5060 Ti) at a generation speed of approximately 94 tokens per second, without the need for high-end data center GPUs. The project supports OpenAI-compatible APIs, balancing performance, deployment convenience, and data privacy requirements.

Section 02

Project Background and Core Positioning

Large language model inference has long faced the pain point of "difficulty running large-parameter models with limited hardware resources". The core goal of the Chimere project is to break this barrier: for Qwen3.5-35B-A3B (35 billion parameters, Gated DeltaNet+MoE architecture), it enables efficient operation on consumer GPUs, allowing ordinary developers and users to enjoy large model inference capabilities without relying on high-end hardware like H100.

Section 03

Technical Architecture and Core Optimizations

Tech Stack Foundation: Based on a deeply customized branch of ik_llama.cpp (supports Mamba-2/Nemotron-H architectures, PR submitted to the community), implemented end-to-end in Rust, compiled into a single binary, and provides OpenAI-compatible HTTP services via the axum framework.
Multi-Architecture Scheduling: Automatically routes requests via the AppStateModel enum; adding new architectures only requires extending the enum and loaders.
Engram Memory System: n-gram log bias mechanism, pre-built 4 domain tables (kine/code/cyber/general), indexed via FNV-1a hash and Cuckoo filter, enabling token-level personalization.
CUDA and Quantization Optimizations: Natively supports NVIDIA Blackwell architecture (sm_120), uses TurboQuant-style K-cache optimization (Hadamard rotated keys + Q8_0/Q4_0 KV quantization), improving throughput by 8% with almost no quality loss.

Section 04

Performance Benchmarks and Real-World Performance

According to official tests, Chimere's performance on RTX 5060 Ti is as follows:

Qwen3.5-35B-A3B (custom IQK quantization): ~80 tokens/sec generation speed under 64K context, 789 tokens/sec prefill, first token latency of 80ms, VRAM usage of 15.3GB;
Nemotron-3-Nano-30B-A3B (Q4_0 quantization): ~45 tokens/sec generation speed. These data prove that consumer hardware can achieve a response experience close to cloud APIs.

Section 05

Multi-Model Support and Deployment Ecosystem

Multi-Model Compatibility: In addition to the Qwen3.5 series, it has verified support for hybrid architecture models like Nemotron-3-Nano-30B-A3B; future plans include expanding to Granite 4.0, Falcon-H1, etc.
Deployment Process: Need to clone and build the ik_llama.cpp backend and chimere-server, with dependencies on CUDA 12.8+ and Rust 1.80+; configure parameters like model path via environment variables, and after startup, it provides OpenAI-compatible APIs (supports streaming chat, tool calls, etc.).
Ecosystem: As part of the AIdevsmartdata ecosystem, supporting projects include chimere-odo (Python orchestrator), chimere-studio (Tauri UI), ramp-quant (quantization pipeline), etc.

Section 06

Conclusion and Future Outlook

Chimere, through system-level optimizations (Rust performance, CUDA kernels, quantization strategies), proves that consumer hardware can handle large model inference tasks, promoting AI democratization and edge computing development. In the future, it will continue to expand model support and is expected to become one of the preferred runtimes for local LLM deployment, providing reliable solutions for data privacy-sensitive scenarios.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49