Zing Forum

Reading

Chitu: In-depth Analysis of a High-Performance Inference Framework for Large Models

This article introduces Chitu, an open-source large model inference framework developed by Tsinghua University's PACMAN Lab, and analyzes its technical innovations and architectural design in terms of efficiency, flexibility, and usability.

大模型推理ChituTransformer量化PagedAttention高性能计算
Published 2026-04-28 09:42Recent activity 2026-04-28 10:00Estimated read 9 min
Chitu: In-depth Analysis of a High-Performance Inference Framework for Large Models
1

Section 01

[Main Floor] Chitu: In-depth Analysis of a High-Performance Inference Framework for Large Models - Introduction

Chitu is an open-source large model inference framework developed by Tsinghua University's PACMAN Lab, designed to address core challenges in large language model inference deployment (ultra-long context processing, massive memory usage, complex parallel strategies, and diverse quantization requirements). Its core advantages lie in three dimensions: efficiency-first architectural design, flexible and extensible modular support, and a complete production-grade serving solution. Additionally, Chitu is deeply adapted to domestic hardware, making it suitable for scenarios such as enterprise private deployment and long document processing, and it represents a top achievement in China's large model inference infrastructure field.

2

Section 02

Project Background and R&D Motivation

As the parameter scale of large language models exceeds 100 billion or even one trillion, inference deployment has become a core challenge in AI engineering. Traditional frameworks (such as TensorFlow Serving and TorchServe) are insufficient to meet the specific needs of Transformer architectures (ultra-long context, memory usage, parallel strategies, and quantization requirements). Tsinghua University's PACMAN Lab developed the Chitu framework to address these pain points, pursuing extreme inference performance while emphasizing flexibility and usability in engineering practice.

3

Section 03

Core Technical Features and Architectural Design

Core Design Philosophy

  • Efficiency First: Optimize memory access, computation graphs, and parallel strategies for Transformer inference;
  • Flexible Expansion: Modular architecture supports multiple models (GPT, LLaMA, etc.), precisions (FP16/INT8/GPTQ, etc.), and hardware (NVIDIA/AMD/domestic chips);
  • Production-Grade Usability: Provides complete serving functions such as dynamic batching, streaming generation, and request scheduling.

Key Technical Features

  • Attention Calculation: Integrates FlashAttention (O(N) memory complexity), PagedAttention (KV Cache block management), and MQA/GQA (reduces memory usage);
  • Quantization Support: Weight quantization (INT8/INT4/GPTQ/AWQ/SmoothQuant), activation quantization, and mixed precision;
  • Parallel Strategies: Tensor/pipeline/sequence/expert parallelism;
  • Inference Optimization: Speculative decoding (2-3x speedup), continuous batching (dynamic request management), and prefix reuse (KV Cache reuse).

Architecture and Memory Management

  • Layered Architecture: Computation layer (optimized operators), graph engine layer (computation graph scheduling), model layer (model definition), and service layer (API/scheduling);
  • Memory Management: Pre-allocation, memory reuse, and KV Cache offloading (CPU/SSD).
4

Section 04

Performance Benchmarks and Framework Comparison

Performance Benchmarks

  • Throughput: Industry-leading throughput on the LLaMA2-70B model, with significant advantages in high-concurrency scenarios;
  • Latency: Optimizes TTFT (Time To First Token) and ITL (Inter-Token Latency), suitable for interactive applications;
  • Memory Efficiency: PagedAttention + quantization technology supports longer context windows.

Framework Comparison

Feature Chitu vLLM TensorRT-LLM llama.cpp
PagedAttention
Speculative Decoding
Domestic Chip Support Partial Partial Partial
Open Source License Apache 2.0 Apache 2.0 Commercial-friendly MIT
Community Activity growing High Medium High

Chitu's unique advantages: Deep support for the domestic hardware ecosystem, and close integration of academic research and industrial practice.

5

Section 05

Application Scenarios and Ecosystem Community

Application Scenarios

  • Enterprise Private Deployment: Supports Hugging Face model loading, adapts to domestic GPUs, and meets information technology innovation (Xinchuang) requirements;
  • Long Document Processing: Sequence parallelism + offloading technology allows consumer-grade hardware to handle tens of thousands to hundreds of thousands of tokens;
  • High-Concurrency Services: Continuous batching + efficient scheduling to maximize hardware utilization and reduce costs.

Ecosystem and Community

  • Model Support: Integrates the latest open-source models such as LLaMA, Qwen, ChatGLM, and Baichuan;
  • Hardware Adaptation: Collaborates with domestic chip manufacturers such as Ascend, Cambricon, and Hygon;
  • Toolchain Integration: Compatible with ecosystem tools like vLLM and Text Generation Inference.
6

Section 06

Future Development Directions and Summary

Future Development Directions

  • Multimodal Expansion: Support for inference optimization of vision-language models;
  • Edge Deployment: Lightweight solutions for mobile/embedded devices (model compression, heterogeneous computing);
  • Automatic Optimization: Workload-based automatic parallel strategy selection and parameter tuning;
  • Training Collaboration: Integrated training-inference design, supporting online learning and hot updates.

Summary

Chitu represents the top level of domestic large model inference infrastructure. Through systematic architectural design and engineering optimization, it meets production-level requirements in efficiency, flexibility, and usability. It is suitable for scenarios such as private deployment, domestic hardware adaptation, and extreme performance optimization, and its development is worth looking forward to.