Zing Forum

Reading

Project Chronos: A Zero-Lag MoE Inference System Based on Predictive Preloading and Asynchronous DMA

Project Chronos addresses the IO bottleneck of MoE models on consumer-grade hardware through expert prediction during the prefill phase, asynchronous DMA prefetching, and a dual-stream transmission architecture, enabling zero-lag inference.

MoE混合专家模型推理优化异步预取专家预测SSD 优化MLX消费级硬件零卡顿
Published 2026-04-23 21:13Recent activity 2026-04-23 21:23Estimated read 3 min
Project Chronos: A Zero-Lag MoE Inference System Based on Predictive Preloading and Asynchronous DMA
1

Section 01

Background: IO Bottleneck of MoE Models on Consumer-Grade Hardware

Mixture of Experts (MoE) models such as Mixtral and DeepSeek-MoE balance capability and cost by dynamically selecting subsets of experts, but face IO bottlenecks when deployed on consumer-grade hardware: Traditional decoding checks whether experts are in VRAM token by token, and blocks loading if missing (latency >40ms); existing offloading runtimes patch storage pressure after the fact, leading to repeated IO overhead that severely impacts the experience.

2

Section 02

Core Architectural Innovations

  1. Prefill Phase Loading Concept: Shift IO operations to the prefill phase, proactively predict expert sets and asynchronously prefetch them, converting to an event-driven pipeline.
  2. Three-Tier Storage Architecture: VRAM permanently hosts shared/hot experts; fixed memory buffers store prefetched experts via mmap; NVMe SSDs organize expert clusters using Louvain clustering to improve read efficiency.
  3. Two-Layer Routing System: IntentClassifier (prefill, 10-15M parameters) predicts the expert set for the entire process; Lookahead Router (decoding, 2M parameters) predicts experts for the next 2 tokens, trained with supervised loss.
3

Section 03

Key Technical Implementations

  • Dual-Stream Transmission and Event Synchronization: The H2D stream handles asynchronous data transmission, the computation stream executes in parallel, and expert-level event synchronization avoids global blocking, maintaining a pipeline slack of over 35ms when simulating 30ms SSD latency.
4

Section 04

Introduction / Main Post: Project Chronos: A Zero-Lag MoE Inference System Based on Predictive Preloading and Asynchronous DMA

Project Chronos addresses the IO bottleneck of MoE models on consumer-grade hardware through expert prediction during the prefill phase, asynchronous DMA prefetching, and a dual-stream transmission architecture, enabling zero-lag inference.