Reading

Chitu: A Production-Grade Large Model Inference Engine Open-Sourced by Tsinghua Team, Fully Supporting Domestic Chips

The Chitu inference framework, open-sourced by Tsinghua University's PACMAN Lab, not only supports the full range of NVIDIA GPUs but also deeply adapts to domestic chips such as Huawei Ascend, Moore Threads, Muxi, and Hygon, enabling full-scenario deployment from single-card to cluster.

Chitu赤兔大模型推理清华PACMAN国产芯片昇腾摩尔线程沐曦DeepSeekQwen

Published 2026-04-01 12:14Recent activity 2026-04-01 12:17Estimated read 7 min

Chitu: A Production-Grade Large Model Inference Engine Open-Sourced by Tsinghua Team, Fully Supporting Domestic Chips

Section 01

Introduction: Tsinghua Open-Sources Chitu Inference Engine, Fully Supporting Domestic Chips and Full-Scenario Deployment

The Chitu (赤兔) inference framework, open-sourced by Tsinghua University's PACMAN Lab, is positioned as a production-grade large model inference engine with both high performance and stability. Its core advantages include: supporting the full range of NVIDIA GPUs and domestic chips such as Huawei Ascend, Moore Threads, Muxi, and Hygon; covering full-scenario deployment from pure CPU, single-card GPU to large-scale clusters; being compatible with mainstream large models like DeepSeek, Qwen, and GLM; and having technical highlights such as FP4/FP8 quantization and CPU+GPU heterogeneous hybrid inference, which can handle real concurrent business traffic.

Section 02

Project Background and Positioning

The Chinese name of Chitu ('赤兔') implies speed and power. Its design goal is to build an efficient, flexible, and usable high-performance inference framework. Unlike engines optimized for a single hardware, it considers the progressive needs of enterprise AI implementation from the very beginning of design, providing scalable solutions from laboratory experiments to large-scale production. It is clearly positioned as 'production-grade', not only pursuing extreme performance but also ensuring long-term operational stability and reliability, capable of handling real concurrent business traffic.

Section 03

Multi-Computing Power Adaptation: Deep Support for Domestic Chips

Chitu's comprehensive support for multi-computing power is one of its core features:

Full range of NVIDIA: Covers products from Blackwell architecture to older series;
Huawei Ascend: v0.3.5 supports native deployment of Ascend 910B, v0.3.9 first launched GLM-4.5 MoE model inference on Ascend;
Moore Threads: Adaptation completed in v0.5.1;
Muxi, Hygon: Performance and stability improved in v0.4.0. This allows enterprises to flexibly choose computing power platforms and avoid single-vendor lock-in.

Section 04

Full-Scenario Scalable Deployment Solutions

Chitu supports full-scenario deployment:

Pure CPU deployment: Reduces hardware thresholds, suitable for lightweight inference scenarios;
Single-card GPU deployment: Through CPU+GPU heterogeneous hybrid inference (v0.2.2), a single card can run the DeepSeek-R1 671B super-large model; v0.3.0 added FP4 online conversion to FP8/BF16 operators, supporting the FP4 quantized version of this model;
Large-scale cluster deployment: v0.5.0 improved cluster performance to meet enterprises' high concurrency needs.

Section 05

Model Ecosystem and Core Technical Highlights

Model Ecosystem: Supports mainstream large models like DeepSeek, Qwen, GLM, and Kimi; v0.3.5 provides high-performance solutions for the Qwen3 series; v0.3.9 first launched GLM-4.5 MoE deployment on Ascend; Technical Highlights:

Quantization Support: v0.1.0 supports FP8 to BF16 conversion; v0.3.0 added FP4 to FP8/BF16 conversion, reducing memory and computing overhead;
Heterogeneous Hybrid Inference: Intelligently distributes CPU/GPU tasks, enabling single-card operation of super-large models;
Production-Grade Stability: Emphasizes long-term stable operation and adapts to real business scenarios.

Section 06

Rapid Deployment and Open-Source Ecosystem

Rapid Deployment: Provides multi-platform Docker images, such as NVIDIA (arch8.0/8.9, 9.0), Muxi, Ascend (A2/A3), etc., lowering the entry threshold; Open-Source Ecosystem: Adopts the Apache License v2.0 protocol, with code hosted on GitHub. The team actively draws inspiration from projects like DeepSeek and FlashAttention, and welcomes community contributions while providing detailed guidelines.

Section 07

Application Value and Future Outlook

Chitu's value to enterprises: Adaptation to domestic chips has strategic significance, and production-grade stability reduces technical risks; Outlook: As large model scenarios expand, the importance of inference engines becomes prominent, and Chitu is expected to play a key role in the domestic ecosystem; Suggestion: Teams that need to reduce inference costs, improve performance, or deploy large models on domestic chips can evaluate and try Chitu.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15