# OxiLLaMa: A Pure Rust LLM Inference Engine, Memory-Safe Alternative to llama.cpp

> OxiLLaMa is an LLM inference engine fully rewritten in Rust, with zero dependencies on C/C++/Fortran. It supports 20 model architectures and 25 quantization formats, provides OpenAI-compatible API services, and aims to build cross-platform, auditable, memory-safe AI inference infrastructure.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-25T07:13:46.000Z
- 最近活动: 2026-04-25T07:19:58.577Z
- 热度: 150.9
- 关键词: Rust, LLM, 推理引擎, llama.cpp, 量化, 内存安全, GGUF, 开源
- 页面链接: https://www.zingnex.cn/en/forum/thread/oxillama-rust-llm-llama-cpp
- Canonical: https://www.zingnex.cn/forum/thread/oxillama-rust-llm-llama-cpp
- Markdown 来源: floors_fallback

---

## OxiLLaMa: A Pure Rust LLM Inference Engine (Memory-Safe Alternative to llama.cpp)

OxiLLaMa is an LLM inference engine fully rewritten in Rust, with zero dependencies on C/C++/Fortran. It supports 20 model architectures and 25 quantization formats, provides OpenAI-compatible API services, and aims to build cross-platform, auditable, memory-safe AI inference infrastructure. It is a core component of the COOLJAPAN pure Rust tech stack.

## Background: Why Do We Need a Pure Rust LLM Inference Engine?

llama.cpp is the de facto standard in the LLM inference field, but C/C++ code has memory safety risks (buffer overflows, dangling pointers, etc.), making deployment in production environments high-risk. OxiLLaMa emerged to reimplement all functionalities of llama.cpp using Rust, creating a pure Rust inference engine with zero FFI and zero system library dependencies.

## Project Architecture and Dependencies

OxiLLaMa is based on the COOLJAPAN pure Rust tech stack and depends on underlying libraries like SciRS2 (tensor primitives), OxiBLAS (matrix operations), and OxiFFT (fast Fourier transform). The project consists of 11 crates, approximately 107,000 lines of Rust code, passes 2020 tests, and its modular architecture supports independent component usage and community contributions.

## Model and Quantization Format Support

**Model Architectures**: Supports 20 mainstream models, including LLaMA series, Mixtral, Qwen3, DeepSeek-V2/V3, Yi, InternLM3, MiniCPM, Mistral, Gemma 2/3, Phi-3/4, Command-R, Falcon, DBRX, Grok-1, Mamba-2, Jamba, LLaVA, etc. A trait-based plugin system is used to add new models.
**Quantization Formats**: Supports 25 formats, including traditional quantization (Q4_0/Q4_1, etc.), K-Quants (Q2_K to Q6_K), I-Quants (IQ1_S/IQ2_XXS, etc.), 1-bit quantization (Q1_0_G128), and floating-point formats (FP16/BF16/FP32). All quantization kernels are SIMD-optimized, achieving over 80% of llama.cpp's speed on x86-64 (AVX2) and ARM64.

## Multi-Scenario Deployment Modes

- **Command-line tools**: `oxillama run` to run models, `oxillama serve` to start OpenAI-compatible API, `oxillama chat --tui` for terminal interactive interface (asynchronous streaming output).
- **Python bindings**: Provides API via PyO3 for easy integration into existing workflows.
- **WebAssembly**: `oxillama-wasm` compiles to WASM, allowing browser execution without a backend.
- **GPU acceleration**: Optional `oxillama-gpu` implements cross-platform GPU acceleration based on wgpu.

## Enterprise-Grade Features

- **Observability**: Built-in monitoring and logging system.
- **Error recovery**: Returns handleable errors on inference failure instead of panicking.
- **Configuration management**: Supports complex runtime configurations.
- **Model management**: `oxillama hub` directly pulls models from HuggingFace Hub (no Python required).
- **Conversation persistence**: `/save`/`/load` to save conversation states; KV cache includes SHA-256 verification.

## Performance Goals and Future Outlook

**Performance**: The goal is to achieve over 80% of llama.cpp's speed on the same hardware. For example, the LLaMA-3-8B Q4_K_M model runs at about 30 tokens/sec on llama.cpp, and OxiLLaMa aims for ≥25 tokens/sec.
**Current Status and Outlook**: Currently in the Alpha phase, all 20 architectures and 25 quantization formats have been implemented, and active development is ongoing. It represents the trend of AI infrastructure migrating to memory-safe languages, suitable for teams that need to move away from C++ dependencies and pursue code auditability.
