Zing Forum

Reading

Air.rs: Implementing 70B+ Large Model Inference on Consumer GPUs with Rust

Air.rs is a Rust-based LLM inference engine that uses the S.L.I.P. (Slipstream Layer Inference Protocol) protocol. Through memory mapping and layer streaming technology, it enables running large language models with over 70B parameters on consumer GPUs with only 24GB of VRAM.

RustLLM推理GPU量化GGUF内存优化Transformer开源
Published 2026-04-10 04:11Recent activity 2026-04-10 04:26Estimated read 7 min
Air.rs: Implementing 70B+ Large Model Inference on Consumer GPUs with Rust
1

Section 01

Introduction / Main Floor: Air.rs: Implementing 70B+ Large Model Inference on Consumer GPUs with Rust

Air.rs is a Rust-based LLM inference engine that uses the S.L.I.P. (Slipstream Layer Inference Protocol) protocol. Through memory mapping and layer streaming technology, it enables running large language models with over 70B parameters on consumer GPUs with only 24GB of VRAM.

2

Section 02

Background: Memory Dilemma in Large Model Inference

Current large language model (LLM) inference faces a core challenge: model size far exceeds VRAM capacity. Take a 70B parameter model as an example: using FP16 precision requires 140 GB of GPU memory; even when quantized to Q4, it still needs 35 GB—which already exceeds the 24GB VRAM limit of the RTX 4090.

Existing solutions often come with painful trade-offs:

  • CPU offloading: inference speed decreases by 10-50x
  • Model parallelism: requires multiple expensive GPUs
  • Aggressive quantization: significant drop in output quality
3

Section 03

Air.rs's Solution: The S.L.I.P. Protocol

Air.rs implements the S.L.I.P. (Slipstream Layer Inference Protocol) protocol. Its core idea is: GGUF files are loaded via memory mapping (mmap), but at any time, only the quantized weights of one layer reside in physical RAM. Weights remain compressed in GGUF block format, and QMatMul performs dequantization during matrix multiplication.

4

Section 04

Memory Usage Comparison

Model Size Traditional Loading Air.rs Layer Streaming
7B ~4 GB ~400 MB
70B ~40 GB ~1.5 GB

This design makes it possible to run 70B+ models on a single consumer GPU.

5

Section 05

STRIX Subsystem

STRIX (Streamed Tensor Residence & Intelligent eXchange) is a GPU offloading protocol that supports running 70B+ models on consumer GPUs. It manages a three-level memory hierarchy (VRAM → RAM → Storage) and has an intelligent eviction scoring mechanism.

Key components include:

  • Tensor Registry: Tensor registration and lifecycle management
  • RAII VRAM Allocation: Automatic VRAM allocation and recycling
  • CUDA/Vulkan/Metal HAL: Multi-backend GPU computing support
  • VRAM Pressure Manager: Five-level VRAM pressure management
  • Security: SecureAllocator, SharedRwLock, BoundsCheckedPtr
6

Section 06

Key Features Overview

Category Feature
Core Layer streaming inference — only one Transformer block is in memory at a time
Quantization Weights remain in GGUF block format; QMatMul dequantizes during matmul
File Format GGUF, SafeTensors, PyTorch (.bin/.pt), ONNX — auto-detected
Memory madvise / PrefetchVirtualMemory page control + mmap storage HAL
KV Cache Hierarchical KV cache with RAM/VRAM switching and LRU eviction
Pipeline Adaptive ring buffer pipeline — overlaps NVMe reads, PCIe, GPU
API OpenAI-compatible /v1/chat/completions (SSE streaming)
Computing NVIDIA CUDA + Vulkan (staging transfers) + Apple Metal GPU backends
Decoding Speculative decoding (draft validation acceleration, 2-3x speedup)
Scheduling Continuous batch request scheduler
Sampling Temperature, top-p, top-k, repetition penalty, min-p
Tokenizer BPE Tokenizer built from GGUF vocabulary
Model Hub Download models from Hugging Face with SHA-256 verification
Monitoring Real-time TUI dashboard + Prometheus-compatible metrics
Templates Jinja2-style chat template engine (ChatML, Llama, Mistral, etc.)
Bindings Optional PyO3 Python bindings
7

Section 07

Current Status (Alpha)

All subsystems have been implemented and tested (468 tests, 0 warnings, 0 failures). The project compiles on three platforms and has production-grade GPU backends. E2E validation passed with a real Llama 3.2 3B Q8 GGUF model.

8

Section 08

Completed Core Features

  • ✅ Compilation on Windows/Linux/macOS
  • ✅ Unit + integration tests (468)
  • ✅ Multi-format model support (GGUF, SafeTensors, PyTorch, ONNX)
  • ✅ Serde configuration (JSON/TOML)
  • ✅ S.L.I.P. layer streaming engine
  • ✅ Transformer forward pass (quantized)
  • ✅ Hierarchical KV cache eviction
  • ✅ Speculative decoding
  • ✅ OpenAI-compatible API
  • ✅ STRIX GPU offloading (CUDA/Vulkan/Metal)
  • ✅ Vulkan staging transfers
  • ✅ VRAM safety model
  • ✅ Mmap storage HAL
  • ✅ E2E validation (real model)
  • ✅ Performance benchmarking