Zing Forum

Reading

LSE-MTP: Multi-Token Prediction with Latent Semantic Enhancement for Building Consistent World Models

The study proposes the LSE-MTP method, which addresses the structural hallucination problem in standard multi-token prediction by anchoring predictions to real hidden state trajectories, effectively bridging the gap between discrete tokens and continuous state representations.

世界模型多令牌预测潜在语义增强结构性幻觉表示学习梯度归纳偏置LLM
Published 2026-04-08 01:54Recent activity 2026-04-08 11:20Estimated read 6 min
LSE-MTP: Multi-Token Prediction with Latent Semantic Enhancement for Building Consistent World Models
1

Section 01

[Introduction] LSE-MTP: Addressing MTP Structural Hallucination to Build Consistent World Models

The consistency of internal world models in Large Language Models (LLMs) is a core debate in the AI field. Traditional Multi-Token Prediction (MTP) can learn structured representations, but it has a structural hallucination problem (discrete token supervision leads to shortcuts in the latent space, violating environmental constraints). This study proposes the Latent Semantic Enhancement Multi-Token Prediction (LSE-MTP) method, which bridges the gap between discrete tokens and continuous state representations by anchoring to real hidden state trajectories, effectively resolving structural hallucinations and improving the consistency and robustness of world models.

2

Section 02

Background: The Debate on LLM World Models and Evolution of Prediction Paradigms

The Debate on LLM World Models

The academic community has divisions on whether LLMs have true world models: one side argues they are statistical pattern matchers that only learn word correlations; the other side believes they form internal models that can reason about world states. The core of the debate lies in whether internal representations capture the world's structure or merely memorize surface patterns.

From NTP to MTP: Evolution of Prediction Paradigms

Traditional Next-Token Prediction (NTP) focuses on single-step accuracy and struggles to capture long-range structures; Multi-Token Prediction (MTP) predicts multiple future tokens simultaneously, encouraging the learning of structured representations, inducing representation contractivity via gradient coupling, and promoting the convergence of internal beliefs.

3

Section 03

Advantages of MTP and Concerns About Structural Hallucinations

The gradient inductive bias of MTP brings representation contractivity, mapping similar inputs to similar latent representations, which is beneficial for structured learning. However, standard MTP has structural hallucinations: discrete token supervision encourages shortcuts in the latent space, violating real-world constraints (such as physical laws), leading to vulnerability under out-of-distribution data.

4

Section 04

The LSE-MTP Method: A Solution Anchored to Real States

The core of LSE-MTP is to anchor predictions to real hidden state trajectories, using dual supervision: it not only predicts future tokens but also predicts corresponding real-world states (such as physical position and velocity). This mechanism prevents latent representations that violate constraints, bridges the gap between discrete tokens and continuous states, and provides additional training signals to enhance robustness.

5

Section 05

Experimental Validation: Effectiveness on Synthetic and Real Tasks

The study validates LSE-MTP on two types of tasks:

  1. Synthetic Graph Traversal: Reduces structural hallucinations, and latent representations better reflect the real topology of the graph;
  2. Manhattan Taxi Trajectory Prediction: Improves prediction accuracy, and its robustness to noise perturbations is significantly better than standard MTP.
6

Section 06

Core Benefits: Representation Alignment and Robustness Improvement

LSE-MTP achieves representation alignment, where latent representations are more consistent with the semantic structure of the real world, enhancing interpretability and generalization ability; it also improves robustness, with more stable performance when facing out-of-distribution data or perturbations, solving the vulnerability problem of standard MTP.

7

Section 07

Future Implications and Conclusion

Future Research Directions

  • Extend to complex modalities such as vision and audio;
  • Explore efficient acquisition of supervision signals (simulation environments, human feedback);
  • Combine with reinforcement learning and imitation learning;
  • Quantify the quality of world models.

Conclusion

LSE-MTP is an important step toward building trustworthy world models. It emphasizes that supervision signals need to balance task performance and real structure, providing new ideas for training AI that truly understands the world.