Reading

Air.rs: Implementing 70B+ Large Model Inference on Consumer GPUs with Rust

Air.rs is a Rust-based LLM inference engine that uses the S.L.I.P. (Slipstream Layer Inference Protocol) protocol. Through memory mapping and layer streaming technology, it enables running large language models with over 70B parameters on consumer GPUs with only 24GB of VRAM.

RustLLM推理GPU量化GGUF内存优化Transformer开源

Published 2026-04-10 04:11Recent activity 2026-04-10 04:26Estimated read 7 min

Section 01

Introduction / Main Floor: Air.rs: Implementing 70B+ Large Model Inference on Consumer GPUs with Rust

Section 02

Background: Memory Dilemma in Large Model Inference

Current large language model (LLM) inference faces a core challenge: model size far exceeds VRAM capacity. Take a 70B parameter model as an example: using FP16 precision requires 140 GB of GPU memory; even when quantized to Q4, it still needs 35 GB—which already exceeds the 24GB VRAM limit of the RTX 4090.

Existing solutions often come with painful trade-offs:

CPU offloading: inference speed decreases by 10-50x
Model parallelism: requires multiple expensive GPUs
Aggressive quantization: significant drop in output quality

Section 03

Air.rs's Solution: The S.L.I.P. Protocol

Air.rs implements the S.L.I.P. (Slipstream Layer Inference Protocol) protocol. Its core idea is: GGUF files are loaded via memory mapping (mmap), but at any time, only the quantized weights of one layer reside in physical RAM. Weights remain compressed in GGUF block format, and QMatMul performs dequantization during matrix multiplication.

Section 04

Memory Usage Comparison

Model Size	Traditional Loading	Air.rs Layer Streaming
7B	~4 GB	~400 MB
70B	~40 GB	~1.5 GB

This design makes it possible to run 70B+ models on a single consumer GPU.

Section 05

STRIX Subsystem

STRIX (Streamed Tensor Residence & Intelligent eXchange) is a GPU offloading protocol that supports running 70B+ models on consumer GPUs. It manages a three-level memory hierarchy (VRAM → RAM → Storage) and has an intelligent eviction scoring mechanism.

Key components include:

Tensor Registry: Tensor registration and lifecycle management
RAII VRAM Allocation: Automatic VRAM allocation and recycling
CUDA/Vulkan/Metal HAL: Multi-backend GPU computing support
VRAM Pressure Manager: Five-level VRAM pressure management
Security: SecureAllocator, SharedRwLock, BoundsCheckedPtr

Section 06

Key Features Overview

Category	Feature
Core	Layer streaming inference — only one Transformer block is in memory at a time
Quantization	Weights remain in GGUF block format; `QMatMul` dequantizes during matmul
File Format	GGUF, SafeTensors, PyTorch (.bin/.pt), ONNX — auto-detected
Memory	`madvise` / `PrefetchVirtualMemory` page control + mmap storage HAL
KV Cache	Hierarchical KV cache with RAM/VRAM switching and LRU eviction
Pipeline	Adaptive ring buffer pipeline — overlaps NVMe reads, PCIe, GPU
API	OpenAI-compatible `/v1/chat/completions` (SSE streaming)
Computing	NVIDIA CUDA + Vulkan (staging transfers) + Apple Metal GPU backends
Decoding	Speculative decoding (draft validation acceleration, 2-3x speedup)
Scheduling	Continuous batch request scheduler
Sampling	Temperature, top-p, top-k, repetition penalty, min-p
Tokenizer	BPE Tokenizer built from GGUF vocabulary
Model Hub	Download models from Hugging Face with SHA-256 verification
Monitoring	Real-time TUI dashboard + Prometheus-compatible metrics
Templates	Jinja2-style chat template engine (ChatML, Llama, Mistral, etc.)
Bindings	Optional PyO3 Python bindings

Section 07

Current Status (Alpha)

All subsystems have been implemented and tested (468 tests, 0 warnings, 0 failures). The project compiles on three platforms and has production-grade GPU backends. E2E validation passed with a real Llama 3.2 3B Q8 GGUF model.

Section 08

Completed Core Features

✅ Compilation on Windows/Linux/macOS
✅ Unit + integration tests (468)
✅ Multi-format model support (GGUF, SafeTensors, PyTorch, ONNX)
✅ Serde configuration (JSON/TOML)
✅ S.L.I.P. layer streaming engine
✅ Transformer forward pass (quantized)
✅ Hierarchical KV cache eviction
✅ Speculative decoding
✅ OpenAI-compatible API
✅ STRIX GPU offloading (CUDA/Vulkan/Metal)
✅ Vulkan staging transfers
✅ VRAM safety model
✅ Mmap storage HAL
✅ E2E validation (real model)
✅ Performance benchmarking

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15