Reading

Pegainfer: An LLM Inference Engine Built with Pure Rust + CUDA

Pegainfer is an LLM inference engine built from scratch, using approximately 7000 lines of Rust code and 3400 lines of handwritten CUDA kernels. It does not rely on PyTorch or any other frameworks and achieves high-performance local LLM inference.

RustCUDALLM推理大语言模型QwenGPU编程推理引擎Transformer

Published 2026-03-28 12:15Recent activity 2026-03-28 12:20Estimated read 7 min

Pegainfer: An LLM Inference Engine Built with Pure Rust + CUDA

Section 01

Pegainfer: Pure Rust+CUDA LLM Inference Engine (Main Guide)

Pegainfer is a zero-dependency large language model (LLM) inference engine built from scratch using ~7000 lines of Rust code and ~3400 lines of handwritten CUDA kernels, with no reliance on PyTorch or any other heavy frameworks. Its core philosophy is "No PyTorch. No frameworks. Just metal", aiming to achieve high-performance local LLM inference. It currently supports Qwen3 series models and delivers excellent performance on consumer GPUs.

Section 02

Background & Project Positioning

Most existing LLM inference engines depend on heavy frameworks like PyTorch or ONNX Runtime, which introduce complex dependency chains and hard-to-control "black box" components. Pegainfer addresses this by taking a different approach—building from scratch to achieve three core goals: 1) Deeply understand the full LLM inference stack; 2) Explore Rust's potential in AI inference (memory safety and concurrency); 3) Implement complete inference functions with minimal code to avoid framework redundancy.

Section 03

Technical Architecture & Key Optimizations

Pegainfer combines Rust's memory safety with CUDA's parallel computing capabilities. Its core components include CLI entry, HTTP server (OpenAI-compatible API), model implementations (Qwen3/Qwen3.5), tensor ops, KV cache management, weight loader, and CUDA kernel bindings. It supports Qwen3-4B/8B (full attention with GQA) and Qwen3.5-4B (hybrid architecture of linear and full attention). Key optimizations: Grouped Query Attention (GQA) for memory efficiency, CUDA Graph for reducing CPU scheduling overhead, kernel fusion (fused MLP, attention) to minimize memory access, and Triton AOT compilation for generating optimized CUDA kernels at build time.

Section 04

Performance Metrics on Consumer GPU

On RTX 5070 Ti (16GB显存, BF16 precision, CUDA Graph enabled):

Qwen3-4B: TTFT ~14ms, TPOT ~11ms/token, throughput ~91 tokens/sec
Qwen3.5-4B: TTFT ~22ms, TPOT ~12.2ms/token, throughput ~82 tokens/sec

TTFT (Time To First Token) measures the delay from prompt to first token generation; TPOT (Time Per Output Token) is the average time per generated token in decoding; throughput is tokens per second.

Section 05

Usage Guide & Engineering Highlights

Environment Setup: Create a virtual environment, install dependencies like torch, transformers. Model Download: Use huggingface-cli to download Qwen3 models. Build & Run: Set CUDA_HOME and PEGAINFER_TRITON_PYTHON, then build with cargo build --release and run with options (model path, CUDA Graph toggle, trace output). API Calls: Supports OpenAI-compatible /v1/completions endpoint (non-stream/stream, sampling params like temperature, top-p).

Engineering highlights: Rust's memory safety ensures GPU memory management security; modular design separates components (tensor, ops, model, server); complete test system (unit/E2E); built-in Chrome Trace for performance analysis.

Section 06

Technical Value & Future Directions

Technical Value: Proves that pure Rust+CUDA can achieve production-level inference performance without heavy frameworks (ideal for resource-constrained scenarios); serves as an educational resource for learning LLM inference; demonstrates Rust's potential in AI infrastructure; shows progressive optimization via Triton AOT. Limitations: Only supports Qwen3 series models; limited quantization (BF16 only); batch processing needs improvement; Windows support is experimental. Future Directions: Support more models (Llama, Mistral), introduce INT8/INT4 quantization, add multi-GPU parallel support, improve scheduling strategies.

Section 07

Conclusion & Open Source Info

Pegainfer is a technical purist's project with ~10k lines of Rust+CUDA code. While not as feature-rich as mature solutions like vLLM or TensorRT-LLM, it provides valuable reference for understanding LLM inference principles, GPU kernel optimization, and Rust's application in AI. It is ideal for developers wanting to deep-dive into Transformer inference or Rust-based AI infrastructure. The project is MIT-licensed, with code and documentation available on GitHub.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15