Reading

OxiLLaMa: A Pure Rust LLM Inference Engine, Memory-Safe Alternative to llama.cpp

OxiLLaMa is an LLM inference engine fully rewritten in Rust, with zero dependencies on C/C++/Fortran. It supports 20 model architectures and 25 quantization formats, provides OpenAI-compatible API services, and aims to build cross-platform, auditable, memory-safe AI inference infrastructure.

RustLLM推理引擎llama.cpp量化内存安全GGUF开源

Published 2026-04-25 15:13Recent activity 2026-04-25 15:19Estimated read 6 min

Section 01

OxiLLaMa: A Pure Rust LLM Inference Engine (Memory-Safe Alternative to llama.cpp)

Section 02

Background: Why Do We Need a Pure Rust LLM Inference Engine?

llama.cpp is the de facto standard in the LLM inference field, but C/C++ code has memory safety risks (buffer overflows, dangling pointers, etc.), making deployment in production environments high-risk. OxiLLaMa emerged to reimplement all functionalities of llama.cpp using Rust, creating a pure Rust inference engine with zero FFI and zero system library dependencies.

Section 03

Project Architecture and Dependencies

OxiLLaMa is based on the COOLJAPAN pure Rust tech stack and depends on underlying libraries like SciRS2 (tensor primitives), OxiBLAS (matrix operations), and OxiFFT (fast Fourier transform). The project consists of 11 crates, approximately 107,000 lines of Rust code, passes 2020 tests, and its modular architecture supports independent component usage and community contributions.

Section 04

Model and Quantization Format Support

Model Architectures: Supports 20 mainstream models, including LLaMA series, Mixtral, Qwen3, DeepSeek-V2/V3, Yi, InternLM3, MiniCPM, Mistral, Gemma 2/3, Phi-3/4, Command-R, Falcon, DBRX, Grok-1, Mamba-2, Jamba, LLaVA, etc. A trait-based plugin system is used to add new models. Quantization Formats: Supports 25 formats, including traditional quantization (Q4_0/Q4_1, etc.), K-Quants (Q2_K to Q6_K), I-Quants (IQ1_S/IQ2_XXS, etc.), 1-bit quantization (Q1_0_G128), and floating-point formats (FP16/BF16/FP32). All quantization kernels are SIMD-optimized, achieving over 80% of llama.cpp's speed on x86-64 (AVX2) and ARM64.

Section 05

Multi-Scenario Deployment Modes

Command-line tools: oxillama run to run models, oxillama serve to start OpenAI-compatible API, oxillama chat --tui for terminal interactive interface (asynchronous streaming output).
Python bindings: Provides API via PyO3 for easy integration into existing workflows.
WebAssembly: oxillama-wasm compiles to WASM, allowing browser execution without a backend.
GPU acceleration: Optional oxillama-gpu implements cross-platform GPU acceleration based on wgpu.

Section 06

Enterprise-Grade Features

Observability: Built-in monitoring and logging system.
Error recovery: Returns handleable errors on inference failure instead of panicking.
Configuration management: Supports complex runtime configurations.
Model management: oxillama hub directly pulls models from HuggingFace Hub (no Python required).
Conversation persistence: /save//load to save conversation states; KV cache includes SHA-256 verification.

Section 07

Performance Goals and Future Outlook

Performance: The goal is to achieve over 80% of llama.cpp's speed on the same hardware. For example, the LLaMA-3-8B Q4_K_M model runs at about 30 tokens/sec on llama.cpp, and OxiLLaMa aims for ≥25 tokens/sec. Current Status and Outlook: Currently in the Alpha phase, all 20 architectures and 25 quantization formats have been implemented, and active development is ongoing. It represents the trend of AI infrastructure migrating to memory-safe languages, suitable for teams that need to move away from C++ dependencies and pursue code auditability.

OxiLLaMa: A Pure Rust LLM Inference Engine, Memory-Safe Alternative to llama.cpp

OxiLLaMa: A Pure Rust LLM Inference Engine (Memory-Safe Alternative to llama.cpp)

Background: Why Do We Need a Pure Rust LLM Inference Engine?

Project Architecture and Dependencies

Model and Quantization Format Support

Multi-Scenario Deployment Modes

Enterprise-Grade Features

Performance Goals and Future Outlook

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model