Section 01
[Introduction] UMo: Core Analysis of Unified Sparse Motion Modeling for Real-Time Speech-Driven Digital Humans
This article introduces UMo—a unified sparse motion modeling architecture for real-time speech-driven digital humans. Using a spatially sparse mixture-of-experts (MoE) framework and a temporally sparse keyframe-centric design, it processes text, audio, and motion tokens in a unified framework, achieving high-fidelity real-time facial and gesture animation generation with low latency, and addressing the key bottleneck of the 'quality-latency' trade-off in existing technologies.