Zing Forum

Reading

Next Forcing: Multi-Chunk Prediction Framework Accelerates Training and Inference of World Models

Inspired by multi-token prediction in large language models, Next Forcing proposes a multi-chunk prediction framework. By simultaneously predicting multiple future video chunks, it achieves faster convergence, higher accuracy, and 2x inference speedup, and attains SOTA performance on the RoboTwin benchmark.

世界模型视频生成多区块预测自回归模型机器人学习物理仿真训练加速推理优化
Published 2026-06-10 01:59Recent activity 2026-06-10 11:53Estimated read 6 min
Next Forcing: Multi-Chunk Prediction Framework Accelerates Training and Inference of World Models
1

Section 01

Introduction: Core Highlights of the Next Forcing Multi-Chunk Prediction Framework

Title: Next Forcing: Multi-Chunk Prediction Framework Accelerates Training and Inference of World Models Abstract: Inspired by multi-token prediction in large language models, Next Forcing proposes a multi-chunk prediction framework. By simultaneously predicting multiple future video chunks, it achieves faster convergence, higher accuracy, and 2x inference speedup, and attains SOTA performance on the RoboTwin benchmark. Source Information: Original Author/Maintainer: arXiv authors; Source Platform: arxiv; Original Title: Next Forcing: Causal World Modeling with Multi-Chunk Prediction; Original Link: http://arxiv.org/abs/2606.11187v1; Publication Time: 2026-06-09T17:59:22Z

2

Section 02

Background: Training Dilemmas of World Action Models

Autoregressive video generation is the mainstream paradigm for building World Action Models (WAMs), but it faces two major challenges: slow training convergence and limited accuracy (especially in high-frame-rate scenarios); slow inference speed due to iterative denoising. The root cause of low training efficiency lies in the flawed design of supervision signals—only the current chunk is supervised, lacking explicit guidance from future dynamics, making it difficult for the model to capture long-range dependencies and limiting the depth of understanding of causal relationships in the physical world.

3

Section 03

Method: Design of Next Forcing's Multi-Chunk Prediction Framework

Inspired by multi-token prediction in LLMs, Next Forcing proposes a Multi-Chunk Prediction (MCP) framework: during training, it simultaneously predicts multiple future video chunks across different time scales, forming a prediction chain from near to far future. Implementation details: add a lightweight auxiliary MCP module to the main model, using a chain structure (next¹→next²→next³), and reuse intermediate features of the main model to balance efficiency and capability. Advantages: near-future predictions guide the far future to form gradient flow; multi-scale temporal supervision signals enrich the density and diversity of training signals.

4

Section 04

Evidence: Experimental Results on Training Acceleration and Accuracy Improvement

Experimental validation of effectiveness: At 50 frames per second, after 5000 training steps, performance is improved by 93.1% relative to LingBot-VA, with convergence speed 2.3x faster; on RoboTwin benchmark, 94.1% in Clean setting and 93.5% in Random setting (SOTA); significant improvements on the physical world video generation (PhyWorld) benchmark; FVD (Fréchet Video Distance) in general video pre-training is reduced by over 50%, with improved generation quality and diversity.

5

Section 05

Evidence: Implementation of Inference Acceleration and Deployment Value

The MCP module is retained in the inference phase to achieve 2x speedup: traditional autoregressive methods require frame-by-frame iterative denoising, while Next Forcing can predict the current and next chunks in parallel. This feature is crucial for latency-sensitive scenarios (robot real-time control, autonomous driving decision-making), reducing latency without sacrificing quality and clearing obstacles for the deployment of WAMs.

6

Section 06

Conclusion and Recommendations: Technical Insights and Future Directions

Technical Insights: The idea of multi-token prediction from LLMs has been successfully transferred to the field of multimodal video generation, and cross-modal technology transfer is worthy of attention. Future Directions: Explore prediction across more time scales, modeling of complex causal structures, and extension to modalities such as audio/tactile. Recommendations for Practitioners: Next Forcing is a ready-to-use tool to improve WAM performance and can serve as a baseline for academic and industrial applications.