Zing Forum

Reading

TIBET-Store MMU: Transparent Memory Virtualization with 7 Microsecond Latency, Software-Defined NVLink for LLM Inference

Explore TIBET-Store MMU—an open-source project based on the Linux userfaultfd mechanism that achieves transparent memory virtualization with 7-microsecond page fault latency. Through innovative MMU illusion technology, this project provides a software-defined memory expansion solution for large model inference, supporting encrypted and compressed storage as well as on-demand loading. It is a cutting-edge exploration in the field of AI infrastructure.

内存虚拟化userfaultfdLLM推理优化透明内存扩展AES-256-GCM加密zstd压缩页面故障处理软件定义NVLinkTIBET生态系统AI基础设施
Published 2026-04-15 17:41Recent activity 2026-04-15 17:48Estimated read 10 min
TIBET-Store MMU: Transparent Memory Virtualization with 7 Microsecond Latency, Software-Defined NVLink for LLM Inference
1

Section 01

【Introduction】TIBET-Store MMU: Transparent Memory Virtualization with 7 Microsecond Latency, Software-Defined NVLink for LLM Inference

TIBET-Store MMU is an open-source project based on the Linux userfaultfd mechanism that achieves transparent memory virtualization with 7-microsecond page fault latency. Through innovative MMU illusion technology, this project provides a software-defined memory expansion solution for large model inference, supporting encrypted and compressed storage as well as on-demand loading. It is a cutting-edge exploration in the field of AI infrastructure.

2

Section 02

Project Background and Core Challenges

In the field of large language model (LLM) inference, the capacity bottleneck of video memory and main memory has always been a core problem restricting the scale of model deployment. As model parameters grow from billions to hundreds of billions, how to efficiently load and run these behemoths under limited physical memory conditions has become a major challenge for infrastructure engineers. Traditional solutions such as model parallelism, pipeline parallelism, and offloading techniques often come with significant communication overhead or performance loss. TIBET-Store MMU adopts a transparent memory virtualization approach, leveraging the Linux kernel's userfaultfd mechanism to allow applications to transparently access virtual spaces far exceeding physical memory, with actual page data loaded from storage on demand.

3

Section 03

Technical Architecture: Implementation Principles of MMU Illusion

1. Virtual Memory Mapping (mmap)

Allocate a huge virtual memory area via mmap, initially without physical memory support. The MAP_ANONYMOUS and MAP_PRIVATE flags ensure privacy and on-demand filling, reserving address space in advance without immediate physical resource allocation.

2. userfaultfd Page Fault Interception

Utilize the Linux userfaultfd feature to let user-space programs take over page fault handling. When an application accesses a virtual page for the first time and triggers a fault, the event is forwarded to the user-space Archivaris thread, which executes the following steps: determine the fault page address index → retrieve compressed data (.tza format) from storage → decompress → inject the page → wake up the application thread.

3. Multi-mode Data Filling Strategy

Provide modes such as ZeroFill (inject zero values), StaticData (fixed data copy), CompressedRestore (compressed recovery), EncryptedRestore (encrypted recovery), and CompressedEncryptedRestore (compressed and encrypted recovery) to adapt to different scenario requirements.

4

Section 04

Performance: Breakthrough Significance of 7-Microsecond Latency

The project achieves a page fault latency of 7 microseconds, breaking the performance bottleneck in the memory virtualization field. Traditional storage access latency is at the millisecond level, and NVMe SSDs also require tens to hundreds of microseconds; this project reduces processing overhead to single-digit microseconds through pre-compressed storage and on-demand decompression loading, approaching memory access performance. Ultra-low latency turns transparent memory virtualization from theory to practice—LLM inference can store model weights in encrypted and compressed containers, loading them on demand without significantly affecting inference latency.

5

Section 05

Security Architecture and Collaborative Optimization of Compression and Encryption

Security Architecture: Airlock Bifurcation Encryption System

Integrate the Airlock Bifurcation encryption subsystem, using the AES-256-GCM algorithm to encrypt each page independently. Introduce identity-based JIS claim access control, where claims include the requester's identity, permissions, roles, and department information, realizing "identity as memory". Without correct credentials, zero-value pages are returned to ensure multi-tenant data isolation.

Collaborative Optimization of Compression and Encryption

In CompressedEncryptedRestore mode, data is first compressed with zstd then encrypted with AES-256-GCM: compression reduces storage and I/O bandwidth requirements, and less encrypted data lowers CPU overhead. Tests show that this combined scheme is faster than plaintext schemes for compressible data, as the reduced I/O overhead outweighs the compression and decompression computation overhead.

6

Section 06

Ecosystem and Application Scenarios

TIBET Ecosystem and Software-Defined NVLink Vision

TIBET-Store MMU is part of the TIBET ecosystem (Transparent Intelligent Backend for Efficient Transformers), which aims to build a transparent, intelligent, and efficient Transformer inference infrastructure. The project proposes the vision of "Software-Defined NVLink for LLM Inference", realizing flexible memory scheduling similar to NVLink through software. Based on standard hardware and open-source software, it supports architectures such as x86 and ARM, with low cost and openness.

Application Scenarios and Practical Value

  • Edge AI deployment: Run larger models on devices with limited memory;
  • Cloud-native AI platforms: Efficient multi-tenant model loading and switching, with encryption ensuring data isolation;
  • Large model fine-tuning: On-demand loading of base models during LoRA fine-tuning, reducing startup time and memory usage;
  • Elastic scaling of inference services: Smooth model loading process for new instances.
7

Section 07

Technical Limitations and Future Outlook

Current limitations: As a PoC project, userfaultfd requires root privileges or CAP_SYS_PTRACE capability, which may be restricted in production environments; it mainly targets single nodes, and multi-node expansion and distributed memory pooling need to be explored; compression and encryption increase CPU overhead, requiring performance trade-offs; documentation and examples are concise, and the community ecosystem needs improvement.

Future outlook: With the development of new memory interconnect technologies such as CXL, hardware-software collaborative memory virtualization may become mainstream, and this project provides a reference for open-source practices.

8

Section 08

Conclusion: Innovative Exploration in AI Infrastructure

TIBET-Store MMU breaks through hardware memory capacity limits through OS-level innovation. Its 7-microsecond latency, transparent virtualization, and built-in security encryption make it a powerful tool for LLM inference optimization. For technicians in AI system optimization, memory virtualization, and large model deployment efficiency, this project is worth in-depth research. The architectural thinking of "memory as a software-defined elastic resource" it demonstrates has important reference value.