Reading

OpenInfer: A Pure Rust + CUDA Large Model Inference Engine Built From Scratch

OpenInfer is an LLM inference engine built entirely from scratch, implemented using only Rust and CUDA, with no dependencies on PyTorch or any model framework runtime.

RustCUDALLM推理引擎PyTorchTritonQwenDeepSeekKimi开源

Published 2026-06-09 22:11Recent activity 2026-06-09 22:24Estimated read 6 min

OpenInfer: A Pure Rust + CUDA Large Model Inference Engine Built From Scratch

Section 01

OpenInfer: Guide to the Zero-Dependency LLM Inference Engine Built with Pure Rust + CUDA

OpenInfer is an LLM inference engine built entirely from scratch, implemented using only Rust and CUDA, with no dependencies on PyTorch or any model framework runtime. The project pursues extreme simplicity and controllability, with approximately 9,600 lines of Rust code, 2,600 lines of CUDA code, and 1,400 lines of Triton kernel code. It provides researchers and engineers with a clean sample to understand the underlying mechanisms of LLM inference, while also featuring production-grade performance and an OpenAI-compatible API.

Section 02

Current State of LLM Inference Deployment and the Birth Background of OpenInfer

LLM inference deployment has long been dominated by frameworks like PyTorch and TensorFlow. While powerful, these frameworks introduce complex dependency chains and underlying behaviors that are difficult to fully control. OpenInfer chose a more challenging path: building entirely from scratch, implementing the inference engine using only Rust and CUDA, aiming to deeply understand each layer of the inference stack and explore the boundaries of possibility for Rust-native inference engines.

Section 03

Technical Architecture and Core Features of OpenInfer

Pure Rust + CUDA Integration: Leverage Rust's memory safety features and CUDA's parallel computing capabilities, achieving seamless integration through the cudarc library, balancing safety and native performance; 2. Triton AOT Kernel Compilation: Complete kernel optimization and generation during the build phase, no Python environment needed at runtime, simplifying deployment; 3. Modular Model Support: Each model is implemented as an independent crate (e.g., openinfer-qwen3-4b), making it easy to add new models and perform targeted optimizations.

Section 04

Performance and Supported Models of OpenInfer

Performance Data (RTX5070Ti 16GB): Qwen3-4B TTFT ~14ms, TPOT ~11ms/tok, throughput ~91tok/s; Qwen3.5-4B TTFT ~22ms, TPOT ~11.8ms/tok, throughput ~85tok/s. Supported Models: Qwen series (3-4B/8B, 3.5-4B), DeepSeek series (V2-Lite, V4-Flash), Kimi K2-Instruct, etc. Some models require feature flags and NCCL support.

Section 05

Practical Significance and Application Scenarios of OpenInfer

Research and Teaching: The codebase with zero framework abstraction is an excellent resource for understanding the mechanisms of LLM inference; 2. Production Environment Optimization: Offers a clean environment without external frameworks, supporting precise control over memory allocation, computation graph optimization, etc.; 3. Edge Deployment: Minimal runtime dependencies, suitable for resource-constrained scenarios, with a compact deployment package.

Section 06

Limitations and Future Outlook of OpenInfer

Current Limitations: Some models (DeepSeek V4, Kimi K2) require specific feature flags and hardware configurations; sampling and logprob support vary by model; Windows support is relatively new and requires additional configuration. Future Outlook: Continuously expand model support, optimize performance, improve cross-platform compatibility—it is a noteworthy underlying technology direction for LLM inference.

Section 07

Build and Deployment Guide for OpenInfer

Environment Requirements: Rust 2024 edition, CUDA Toolkit (nvcc, cuBLAS), NVIDIA driver R535+, Python3 + Triton (build time only). Build Process: 1. Set up the Python environment (install torch via uv venv); 2. Download models (using huggingface-cli); 3. Configure environment variables (CUDA_HOME, etc.); 4. Start the service with cargo run --release.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23