# Panorama of Dynamic Inference Acceleration: A Review of Efficiency Optimization Techniques for AIGC, MLLM, and VLA Models

> This article systematically reviews the cutting-edge inference acceleration technologies for video generation, multimodal large models (MLLM), and vision-language-action (VLA) models from 2025 to 2026, covering core technical directions such as cache reuse, dynamic token pruning, sparse attention, and distillation.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-19T09:48:06.000Z
- 最近活动: 2026-05-19T09:55:35.084Z
- 热度: 154.9
- 关键词: 推理加速, 视频生成, 多模态大模型, VLA, 动态剪枝, 缓存复用, 稀疏注意力, 扩散模型, 具身智能, 端侧部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/aigcmllmvla
- Canonical: https://www.zingnex.cn/forum/thread/aigcmllmvla
- Markdown 来源: floors_fallback

---

## Panorama of Dynamic Inference Acceleration: A Review of Efficiency Optimization Techniques for AIGC, MLLM, and VLA Models (Introduction)

### Core Insights
This article systematically reviews the cutting-edge inference acceleration technologies for video generation, multimodal large models (MLLM), and vision-language-action (VLA) models from 2025 to 2026, covering directions like cache reuse, dynamic token pruning, sparse attention, and distillation.

### Three Forms of Redundant Computation
- **Video diffusion models**: High similarity in features between adjacent denoising time steps, DiT blocks, or layers
- **LLM/MLLM**: Long context, uneven importance distribution of visual/video tokens
- **VLA models**: Continuous observation frames, action sequence structure, and differences in step importance

Based on this, researchers have proposed targeted acceleration strategies such as cache reuse, dynamic token pruning, and early exit of visual tokens, aiming to improve computational efficiency while maintaining quality and success rate.

## Background: Urgent Need for Inference Efficiency of Large Models

With the rapid development of large language models, multimodal models, and embodied intelligence models, inference computing costs have become a key bottleneck for the implementation of AI applications. Whether it's video diffusion models in the AIGC field or VLA models in the robotics field, the huge number of parameters and complex computation graphs make real-time inference extremely challenging.

Traditional compression and quantization techniques often come at the cost of accuracy, while recent studies have found that there is a large amount of redundant computation in the inference process that can be reused, pruned, or exited early, opening up new directions for dynamic inference acceleration. This review systematically sorts out the latest progress in this field from 2025 to 2026, providing a reference for researchers and engineers.

## Core Technical Strategies: Six Directions of Dynamic Inference Acceleration

The review categorizes existing work into six technical categories:
1. **Cache Reuse**: TeaCache, AdaCache, etc., reuse intermediate results by identifying feature similarity, which has high engineering implementation value.
2. **Dynamic Token Pruning**: SDTP, SlimInfer, etc., estimate token importance layer by layer and prune secondary tokens (core challenges: importance evaluation and information loss handling).
3. **Early Exit and Layer Skipping**: DyVTE, DySL-VLA, etc., allow the model to exit early after reaching a confidence level, saving computational overhead.
4. **Sparse Attention**: PASA and others dynamically allocate attention budgets to alleviate video flickering issues.
5. **Fewer-Step Distillation**: RMD, DisCa, etc., reduce the number of diffusion sampling steps and optimize cross-resolution distribution matching.
6. **Action Representation Optimization**: FAST, OpenVLA-OFT, etc., convert continuous actions into short tokens/blocks to reduce latency.

## AIGC Video Generation Acceleration: Multi-Level Cache Reuse Practices

Acceleration of video diffusion models requires multi-level optimization:
- **Time Step Dimension**: TeaCache uses time step embedding to estimate differences between adjacent denoising steps and reuses cache in stable phases.
- **DiT Block Level**: BWCache found that feature changes follow a U-shaped distribution (large changes in shallow/deep layers, stable in middle layers), and uses block-level dynamic caching.
- **Adaptive Mechanism**: AdaCache dynamically schedules cache according to video generation difficulty to improve acceleration ratio; EasyCache proposes a training-independent runtime adaptive scheme with strong engineering practicality.

## MLLM and VLA Acceleration: Visual Token Management and Edge-Side Optimization

#### MLLM Acceleration: Fine-Grained Management of Visual Tokens
- DyVTE: Dynamic early exit of visual tokens; after text tokens get sufficient information, they exit subsequent computations.
- ATP-LLaVA: Instance-level and layer-level adaptive visual token pruning (limitation: pruned tokens cannot be recovered).
- DTP: For VLA scenarios, prunes interfering visual tokens irrelevant to the task.

#### VLA Acceleration: End-to-End Optimization
- OpenVLA-OFT: Parallel decoding + action blocking + continuous action representation, improving speed while maintaining success rate.
- VLA-Cache: Reuses stable visual token KV cache in continuous observation frames, reducing CUDA latency by about 
1.7x.
- SmolVLA: Small architecture + asynchronous inference stack, suitable for low-cost edge-side deployment.
- Stable-FAST: Focuses on the stability of autoregressive VLA inference, jointly optimizing speed and control smoothness.

## Research Gaps and Future Directions

There are six major gaps in current research:
1. Real-time inference is not fully solved; the acceleration ratio is still far from real-time video generation.
2. Speed-quality trade-off: Cache, pruning, and distillation introduce error accumulation; explicit modeling of quality loss and propagation is needed.
3. Unreliable dynamic scoring mechanisms: The effectiveness of attention scoring is questioned in the MLLM field.
4. Removed information is hard to recover: Research on recoverable pruning, soft cache, or uncertainty-triggered recomputation is needed.
5. FLOPs reduction ≠ latency reduction: Batch processing, KV cache, etc., affect final latency in real systems.
6.
VLA acceleration needs to balance control stability: Speed and robot performance need to be jointly optimized.

Future directions should focus on the above gaps to explore more robust dynamic inference technologies.

## Research Recommendations and Reading Roadmap

#### Recommended Research Entry Points
1. **Recoverable Dynamic Computation**: Add recoverable mechanisms in pruning/early exit to avoid losses from early misjudgments.
2. **Task-Sensitive Error Control**: Introduce task feedback constraints such as action stability and robot success rate.
3. **Hardware-Friendly Scheduling**: Jointly design dynamic strategies with KV cache, GPU parallelism, and edge-side deployment.

#### Reading Roadmap
- **Cache Reuse**: SmoothCache → TeaCache → AdaCache → EasyCache → BWCache
- **Dynamic Token Methods**: SDTP → ATP-LLaVA → DyVTE → DyCoke
- **VLA Acceleration**: FAST → OpenVLA-OFT → VLA-Cache → EfficientVLA → Stable-FAST

Organizing research from six perspectives such as real-time performance, quality loss, and dynamic judgment can form a systematic understanding.
