Section 01
Background: IO Bottleneck of MoE Models on Consumer-Grade Hardware
Mixture of Experts (MoE) models such as Mixtral and DeepSeek-MoE balance capability and cost by dynamically selecting subsets of experts, but face IO bottlenecks when deployed on consumer-grade hardware: Traditional decoding checks whether experts are in VRAM token by token, and blocks loading if missing (latency >40ms); existing offloading runtimes patch storage pressure after the fact, leading to repeated IO overhead that severely impacts the experience.