Zing Forum

Reading

OpenInfer: A Pure Rust + CUDA Large Model Inference Engine Built From Scratch

OpenInfer is an LLM inference engine built entirely from scratch, implemented using only Rust and CUDA, with no dependencies on PyTorch or any model framework runtime.

RustCUDALLM推理引擎PyTorchTritonQwenDeepSeekKimi开源
Published 2026-06-09 22:11Recent activity 2026-06-09 22:24Estimated read 6 min
OpenInfer: A Pure Rust + CUDA Large Model Inference Engine Built From Scratch
1

Section 01

OpenInfer: Guide to the Zero-Dependency LLM Inference Engine Built with Pure Rust + CUDA

OpenInfer is an LLM inference engine built entirely from scratch, implemented using only Rust and CUDA, with no dependencies on PyTorch or any model framework runtime. The project pursues extreme simplicity and controllability, with approximately 9,600 lines of Rust code, 2,600 lines of CUDA code, and 1,400 lines of Triton kernel code. It provides researchers and engineers with a clean sample to understand the underlying mechanisms of LLM inference, while also featuring production-grade performance and an OpenAI-compatible API.

2

Section 02

Current State of LLM Inference Deployment and the Birth Background of OpenInfer

LLM inference deployment has long been dominated by frameworks like PyTorch and TensorFlow. While powerful, these frameworks introduce complex dependency chains and underlying behaviors that are difficult to fully control. OpenInfer chose a more challenging path: building entirely from scratch, implementing the inference engine using only Rust and CUDA, aiming to deeply understand each layer of the inference stack and explore the boundaries of possibility for Rust-native inference engines.

3

Section 03

Technical Architecture and Core Features of OpenInfer

  1. Pure Rust + CUDA Integration: Leverage Rust's memory safety features and CUDA's parallel computing capabilities, achieving seamless integration through the cudarc library, balancing safety and native performance; 2. Triton AOT Kernel Compilation: Complete kernel optimization and generation during the build phase, no Python environment needed at runtime, simplifying deployment; 3. Modular Model Support: Each model is implemented as an independent crate (e.g., openinfer-qwen3-4b), making it easy to add new models and perform targeted optimizations.
4

Section 04

Performance and Supported Models of OpenInfer

Performance Data (RTX5070Ti 16GB): Qwen3-4B TTFT ~14ms, TPOT ~11ms/tok, throughput ~91tok/s; Qwen3.5-4B TTFT ~22ms, TPOT ~11.8ms/tok, throughput ~85tok/s. Supported Models: Qwen series (3-4B/8B, 3.5-4B), DeepSeek series (V2-Lite, V4-Flash), Kimi K2-Instruct, etc. Some models require feature flags and NCCL support.

5

Section 05

Practical Significance and Application Scenarios of OpenInfer

  1. Research and Teaching: The codebase with zero framework abstraction is an excellent resource for understanding the mechanisms of LLM inference; 2. Production Environment Optimization: Offers a clean environment without external frameworks, supporting precise control over memory allocation, computation graph optimization, etc.; 3. Edge Deployment: Minimal runtime dependencies, suitable for resource-constrained scenarios, with a compact deployment package.
6

Section 06

Limitations and Future Outlook of OpenInfer

Current Limitations: Some models (DeepSeek V4, Kimi K2) require specific feature flags and hardware configurations; sampling and logprob support vary by model; Windows support is relatively new and requires additional configuration. Future Outlook: Continuously expand model support, optimize performance, improve cross-platform compatibility—it is a noteworthy underlying technology direction for LLM inference.

7

Section 07

Build and Deployment Guide for OpenInfer

Environment Requirements: Rust 2024 edition, CUDA Toolkit (nvcc, cuBLAS), NVIDIA driver R535+, Python3 + Triton (build time only). Build Process: 1. Set up the Python environment (install torch via uv venv); 2. Download models (using huggingface-cli); 3. Configure environment variables (CUDA_HOME, etc.); 4. Start the service with cargo run --release.