Zing Forum

Reading

UMo: Unified Sparse Motion Modeling for Real-Time Speech-Driven Digital Humans

This article introduces UMo, a unified sparse motion modeling architecture. By leveraging a spatially sparse mixture-of-experts (MoE) framework and a temporally sparse keyframe-centric design, it processes text, audio, and motion tokens within a unified framework, enabling high-fidelity real-time speech-driven facial and gesture animation generation under low-latency conditions.

数字人语音驱动动画稀疏建模专家混合实时推理多模态学习面部动画手势生成
Published 2026-05-14 19:56Recent activity 2026-05-15 12:22Estimated read 7 min
UMo: Unified Sparse Motion Modeling for Real-Time Speech-Driven Digital Humans
1

Section 01

[Introduction] UMo: Core Analysis of Unified Sparse Motion Modeling for Real-Time Speech-Driven Digital Humans

This article introduces UMo—a unified sparse motion modeling architecture for real-time speech-driven digital humans. Using a spatially sparse mixture-of-experts (MoE) framework and a temporally sparse keyframe-centric design, it processes text, audio, and motion tokens in a unified framework, achieving high-fidelity real-time facial and gesture animation generation with low latency, and addressing the key bottleneck of the 'quality-latency' trade-off in existing technologies.

2

Section 02

Background: Real-Time Challenges in Digital Human Technology

In the fields of gaming, virtual production, and interactive media, speech-driven gesture and facial animation are core capabilities for building expressive digital humans. Existing technologies face a dilemma: unimodal methods are efficient but cannot fully exploit the potential of multimodal data; multimodal models can integrate more information but are limited by representation capacity and computational throughput, making it difficult to achieve both high-quality motion generation and real-time performance. This 'quality-latency' trade-off restricts the practical application of digital human technology.

3

Section 03

Methodology: UMo's Unified Sparse Architecture and Training Scheme

Core of UMo Architecture

  1. Unified Multimodal Token Representation: Text, audio, and motion are all represented as unified token sequences, simplifying the architecture, enhancing interaction, and enabling flexible expansion.
  2. Spatial Sparsity: Mixture-of-Experts (MoE) Framework: Dynamically selects a subset of expert networks to process inputs, decoupling parameter count from computational load, enabling specialized learning, and improving scalability.
  3. Temporal Sparsity: Keyframe-Centric Design: First generates keyframes that capture major changes, then reconstructs dense sequences via interpolation, reducing the number of generated frames while ensuring temporal coherence.

Training Strategy

  • Multi-Stage Progressive Training: Pre-training (basic motion representation) → Multimodal alignment (speech-action pairing) → Fine-tuning (high-quality small-scale data).
  • Targeted Audio Enhancement: Acoustic diversity enhancement (speed variation, pitch adjustment, noise addition) + semantic consistency preservation to improve model robustness.
4

Section 04

Experimental Validation: UMo's Dual Breakthrough in Quality and Efficiency

Evaluation Metrics

Covers motion quality (naturalness, diversity, speech matching), facial animation quality (expression richness, lip synchronization), temporal coherence, and latency performance.

Core Results

  1. Low Latency with High Quality: Breaks the 'quality-latency' trade-off curse;
  2. Real-Time Performance: Achieves real-time inference on standard hardware;
  3. Fine-Grained Alignment: Captures subtle synergies between speech and actions (e.g., synchronization of stress with emphasis gestures);
  4. Facial and Gesture Coordination: The unified architecture avoids incoherence between the two.
5

Section 05

Conclusion: Summary of UMo's Technical Innovations and Value

UMo's contributions include:

  • Architecture Level: First simultaneous application of spatial sparsity (MoE) and temporal sparsity (keyframe) mechanisms in speech-driven motion generation;
  • Training Level: Combination of multi-stage training and audio enhancement provides a reusable methodology;
  • Application Level: Proves the feasibility of achieving high-fidelity real-time digital humans on consumer-grade hardware, lowering the threshold for deployment.
6

Section 06

Application Scenarios: Industrial Deployment Potential of UMo Technology

UMo brings new possibilities to multiple industries:

  • Gaming and Virtual Worlds: Improves NPC animation naturalness and Vtuber real-time performance;
  • Film and Television Production: Accelerates virtual production workflows and reduces iteration costs;
  • Remote Meeting Collaboration: Enhances presence in VR/AR meetings;
  • Education and Training: Improves the expressiveness of virtual teachers and optimizes learning experiences.
7

Section 07

Future Directions: Optimization Space and Exploration Paths for UMo

UMo still needs to explore:

  1. Style Control: Strengthen control over specific styles (cultural gestures, personalized expressions);
  2. Multi-Speaker Interaction: Extend to multi-person dialogue scenarios;
  3. Full-Body Motion: Coordinated generation of full-body movements (lower limbs, walking, etc.);
  4. Emotional Expression: Adjust facial expressions and postures based on speech emotion.