Zing 论坛

正文

Pegainfer:纯Rust+CUDA构建的LLM推理引擎

Pegainfer是一个从零开始构建的大语言模型推理引擎,仅用约7000行Rust代码和3400行手写CUDA内核,无需PyTorch或任何框架,实现了高性能的本地LLM推理。

RustCUDALLM推理大语言模型QwenGPU编程推理引擎Transformer
发布时间 2026/03/28 12:15最近活动 2026/03/28 12:20预计阅读 7 分钟
Pegainfer:纯Rust+CUDA构建的LLM推理引擎
1

章节 01

Pegainfer: Pure Rust+CUDA LLM Inference Engine (Main Guide)

Pegainfer is a zero-dependency large language model (LLM) inference engine built from scratch using ~7000 lines of Rust code and ~3400 lines of handwritten CUDA kernels, with no reliance on PyTorch or any other heavy frameworks. Its core philosophy is "No PyTorch. No frameworks. Just metal", aiming to achieve high-performance local LLM inference. It currently supports Qwen3 series models and delivers excellent performance on consumer GPUs.

2

章节 02

Background & Project Positioning

Most existing LLM inference engines depend on heavy frameworks like PyTorch or ONNX Runtime, which introduce complex dependency chains and hard-to-control "black box" components. Pegainfer addresses this by taking a different approach—building from scratch to achieve three core goals: 1) Deeply understand the full LLM inference stack; 2) Explore Rust's potential in AI inference (memory safety and concurrency); 3) Implement complete inference functions with minimal code to avoid framework redundancy.

3

章节 03

Technical Architecture & Key Optimizations

Pegainfer combines Rust's memory safety with CUDA's parallel computing capabilities. Its core components include CLI entry, HTTP server (OpenAI-compatible API), model implementations (Qwen3/Qwen3.5), tensor ops, KV cache management, weight loader, and CUDA kernel bindings. It supports Qwen3-4B/8B (full attention with GQA) and Qwen3.5-4B (hybrid architecture of linear and full attention). Key optimizations: Grouped Query Attention (GQA) for memory efficiency, CUDA Graph for reducing CPU scheduling overhead, kernel fusion (fused MLP, attention) to minimize memory access, and Triton AOT compilation for generating optimized CUDA kernels at build time.

4

章节 04

Performance Metrics on Consumer GPU

On RTX 5070 Ti (16GB显存, BF16 precision, CUDA Graph enabled):

  • Qwen3-4B: TTFT ~14ms, TPOT ~11ms/token, throughput ~91 tokens/sec
  • Qwen3.5-4B: TTFT ~22ms, TPOT ~12.2ms/token, throughput ~82 tokens/sec

TTFT (Time To First Token) measures the delay from prompt to first token generation; TPOT (Time Per Output Token) is the average time per generated token in decoding; throughput is tokens per second.

5

章节 05

Usage Guide & Engineering Highlights

Environment Setup: Create a virtual environment, install dependencies like torch, transformers. Model Download: Use huggingface-cli to download Qwen3 models. Build & Run: Set CUDA_HOME and PEGAINFER_TRITON_PYTHON, then build with cargo build --release and run with options (model path, CUDA Graph toggle, trace output). API Calls: Supports OpenAI-compatible /v1/completions endpoint (non-stream/stream, sampling params like temperature, top-p).

Engineering highlights: Rust's memory safety ensures GPU memory management security; modular design separates components (tensor, ops, model, server); complete test system (unit/E2E); built-in Chrome Trace for performance analysis.

6

章节 06

Technical Value & Future Directions

Technical Value: Proves that pure Rust+CUDA can achieve production-level inference performance without heavy frameworks (ideal for resource-constrained scenarios); serves as an educational resource for learning LLM inference; demonstrates Rust's potential in AI infrastructure; shows progressive optimization via Triton AOT. Limitations: Only supports Qwen3 series models; limited quantization (BF16 only); batch processing needs improvement; Windows support is experimental. Future Directions: Support more models (Llama, Mistral), introduce INT8/INT4 quantization, add multi-GPU parallel support, improve scheduling strategies.

7

章节 07

Conclusion & Open Source Info

Pegainfer is a technical purist's project with ~10k lines of Rust+CUDA code. While not as feature-rich as mature solutions like vLLM or TensorRT-LLM, it provides valuable reference for understanding LLM inference principles, GPU kernel optimization, and Rust's application in AI. It is ideal for developers wanting to deep-dive into Transformer inference or Rust-based AI infrastructure. The project is MIT-licensed, with code and documentation available on GitHub.