Zing Forum

Reading

Pegainfer: An LLM Inference Engine Built with Pure Rust + CUDA

Pegainfer is an LLM inference engine built from scratch, using approximately 7000 lines of Rust code and 3400 lines of handwritten CUDA kernels. It does not rely on PyTorch or any other frameworks and achieves high-performance local LLM inference.

RustCUDALLM推理大语言模型QwenGPU编程推理引擎Transformer
Published 2026-03-28 12:15Recent activity 2026-03-28 12:20Estimated read 7 min
Pegainfer: An LLM Inference Engine Built with Pure Rust + CUDA
1

Section 01

Pegainfer: Pure Rust+CUDA LLM Inference Engine (Main Guide)

Pegainfer is a zero-dependency large language model (LLM) inference engine built from scratch using ~7000 lines of Rust code and ~3400 lines of handwritten CUDA kernels, with no reliance on PyTorch or any other heavy frameworks. Its core philosophy is "No PyTorch. No frameworks. Just metal", aiming to achieve high-performance local LLM inference. It currently supports Qwen3 series models and delivers excellent performance on consumer GPUs.

2

Section 02

Background & Project Positioning

Most existing LLM inference engines depend on heavy frameworks like PyTorch or ONNX Runtime, which introduce complex dependency chains and hard-to-control "black box" components. Pegainfer addresses this by taking a different approach—building from scratch to achieve three core goals: 1) Deeply understand the full LLM inference stack; 2) Explore Rust's potential in AI inference (memory safety and concurrency); 3) Implement complete inference functions with minimal code to avoid framework redundancy.

3

Section 03

Technical Architecture & Key Optimizations

Pegainfer combines Rust's memory safety with CUDA's parallel computing capabilities. Its core components include CLI entry, HTTP server (OpenAI-compatible API), model implementations (Qwen3/Qwen3.5), tensor ops, KV cache management, weight loader, and CUDA kernel bindings. It supports Qwen3-4B/8B (full attention with GQA) and Qwen3.5-4B (hybrid architecture of linear and full attention). Key optimizations: Grouped Query Attention (GQA) for memory efficiency, CUDA Graph for reducing CPU scheduling overhead, kernel fusion (fused MLP, attention) to minimize memory access, and Triton AOT compilation for generating optimized CUDA kernels at build time.

4

Section 04

Performance Metrics on Consumer GPU

On RTX 5070 Ti (16GB显存, BF16 precision, CUDA Graph enabled):

  • Qwen3-4B: TTFT ~14ms, TPOT ~11ms/token, throughput ~91 tokens/sec
  • Qwen3.5-4B: TTFT ~22ms, TPOT ~12.2ms/token, throughput ~82 tokens/sec

TTFT (Time To First Token) measures the delay from prompt to first token generation; TPOT (Time Per Output Token) is the average time per generated token in decoding; throughput is tokens per second.

5

Section 05

Usage Guide & Engineering Highlights

Environment Setup: Create a virtual environment, install dependencies like torch, transformers. Model Download: Use huggingface-cli to download Qwen3 models. Build & Run: Set CUDA_HOME and PEGAINFER_TRITON_PYTHON, then build with cargo build --release and run with options (model path, CUDA Graph toggle, trace output). API Calls: Supports OpenAI-compatible /v1/completions endpoint (non-stream/stream, sampling params like temperature, top-p).

Engineering highlights: Rust's memory safety ensures GPU memory management security; modular design separates components (tensor, ops, model, server); complete test system (unit/E2E); built-in Chrome Trace for performance analysis.

6

Section 06

Technical Value & Future Directions

Technical Value: Proves that pure Rust+CUDA can achieve production-level inference performance without heavy frameworks (ideal for resource-constrained scenarios); serves as an educational resource for learning LLM inference; demonstrates Rust's potential in AI infrastructure; shows progressive optimization via Triton AOT. Limitations: Only supports Qwen3 series models; limited quantization (BF16 only); batch processing needs improvement; Windows support is experimental. Future Directions: Support more models (Llama, Mistral), introduce INT8/INT4 quantization, add multi-GPU parallel support, improve scheduling strategies.

7

Section 07

Conclusion & Open Source Info

Pegainfer is a technical purist's project with ~10k lines of Rust+CUDA code. While not as feature-rich as mature solutions like vLLM or TensorRT-LLM, it provides valuable reference for understanding LLM inference principles, GPU kernel optimization, and Rust's application in AI. It is ideal for developers wanting to deep-dive into Transformer inference or Rust-based AI infrastructure. The project is MIT-licensed, with code and documentation available on GitHub.