Section 01
Panorama of Dynamic Inference Acceleration: A Review of Efficiency Optimization Techniques for AIGC, MLLM, and VLA Models (Introduction)
Core Insights
This article systematically reviews the cutting-edge inference acceleration technologies for video generation, multimodal large models (MLLM), and vision-language-action (VLA) models from 2025 to 2026, covering directions like cache reuse, dynamic token pruning, sparse attention, and distillation.
Three Forms of Redundant Computation
- Video diffusion models: High similarity in features between adjacent denoising time steps, DiT blocks, or layers
- LLM/MLLM: Long context, uneven importance distribution of visual/video tokens
- VLA models: Continuous observation frames, action sequence structure, and differences in step importance
Based on this, researchers have proposed targeted acceleration strategies such as cache reuse, dynamic token pruning, and early exit of visual tokens, aiming to improve computational efficiency while maintaining quality and success rate.