Reading

Next Forcing: Multi-Chunk Prediction Framework Accelerates Training and Inference of World Models

Inspired by multi-token prediction in large language models, Next Forcing proposes a multi-chunk prediction framework. By simultaneously predicting multiple future video chunks, it achieves faster convergence, higher accuracy, and 2x inference speedup, and attains SOTA performance on the RoboTwin benchmark.

世界模型视频生成多区块预测自回归模型机器人学习物理仿真训练加速推理优化

Published 2026-06-10 01:59Recent activity 2026-06-10 11:53Estimated read 6 min

Next Forcing: Multi-Chunk Prediction Framework Accelerates Training and Inference of World Models

Section 01

Introduction: Core Highlights of the Next Forcing Multi-Chunk Prediction Framework

Title: Next Forcing: Multi-Chunk Prediction Framework Accelerates Training and Inference of World Models Abstract: Inspired by multi-token prediction in large language models, Next Forcing proposes a multi-chunk prediction framework. By simultaneously predicting multiple future video chunks, it achieves faster convergence, higher accuracy, and 2x inference speedup, and attains SOTA performance on the RoboTwin benchmark. Source Information: Original Author/Maintainer: arXiv authors; Source Platform: arxiv; Original Title: Next Forcing: Causal World Modeling with Multi-Chunk Prediction; Original Link: http://arxiv.org/abs/2606.11187v1; Publication Time: 2026-06-09T17:59:22Z

Section 02

Background: Training Dilemmas of World Action Models

Autoregressive video generation is the mainstream paradigm for building World Action Models (WAMs), but it faces two major challenges: slow training convergence and limited accuracy (especially in high-frame-rate scenarios); slow inference speed due to iterative denoising. The root cause of low training efficiency lies in the flawed design of supervision signals—only the current chunk is supervised, lacking explicit guidance from future dynamics, making it difficult for the model to capture long-range dependencies and limiting the depth of understanding of causal relationships in the physical world.

Section 03

Method: Design of Next Forcing's Multi-Chunk Prediction Framework

Inspired by multi-token prediction in LLMs, Next Forcing proposes a Multi-Chunk Prediction (MCP) framework: during training, it simultaneously predicts multiple future video chunks across different time scales, forming a prediction chain from near to far future. Implementation details: add a lightweight auxiliary MCP module to the main model, using a chain structure (next¹→next²→next³), and reuse intermediate features of the main model to balance efficiency and capability. Advantages: near-future predictions guide the far future to form gradient flow; multi-scale temporal supervision signals enrich the density and diversity of training signals.

Section 04

Evidence: Experimental Results on Training Acceleration and Accuracy Improvement

Experimental validation of effectiveness: At 50 frames per second, after 5000 training steps, performance is improved by 93.1% relative to LingBot-VA, with convergence speed 2.3x faster; on RoboTwin benchmark, 94.1% in Clean setting and 93.5% in Random setting (SOTA); significant improvements on the physical world video generation (PhyWorld) benchmark; FVD (Fréchet Video Distance) in general video pre-training is reduced by over 50%, with improved generation quality and diversity.

Section 05

Evidence: Implementation of Inference Acceleration and Deployment Value

The MCP module is retained in the inference phase to achieve 2x speedup: traditional autoregressive methods require frame-by-frame iterative denoising, while Next Forcing can predict the current and next chunks in parallel. This feature is crucial for latency-sensitive scenarios (robot real-time control, autonomous driving decision-making), reducing latency without sacrificing quality and clearing obstacles for the deployment of WAMs.

Section 06

Conclusion and Recommendations: Technical Insights and Future Directions

Technical Insights: The idea of multi-token prediction from LLMs has been successfully transferred to the field of multimodal video generation, and cross-modal technology transfer is worthy of attention. Future Directions: Explore prediction across more time scales, modeling of complex causal structures, and extension to modalities such as audio/tactile. Recommendations for Practitioners: Next Forcing is a ready-to-use tool to improve WAM performance and can serve as a baseline for academic and industrial applications.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23