Zing Forum

Reading

HERMES++: A Unified Driving World Model Integrating 3D Scene Understanding and Prediction

HERMES++ integrates 3D scene understanding and future geometric prediction into a single framework for the first time through four innovative designs: BEV representation, LLM-enhanced world query, current-future link, and joint geometric optimization, outperforming specialized methods in multiple benchmark tests.

自动驾驶世界模型3D场景理解点云预测大语言模型BEV表示
Published 2026-05-01 01:59Recent activity 2026-05-01 11:22Estimated read 6 min
HERMES++: A Unified Driving World Model Integrating 3D Scene Understanding and Prediction
1

Section 01

[Introduction] HERMES++: A Unified Driving World Model Integrating 3D Scene Understanding and Prediction

Autonomous driving technology faces the core dilemma of separating 3D scene semantic understanding and future geometric prediction; existing world models often lean towards one end. HERMES++ integrates the two into a single framework for the first time through four innovative designs: BEV representation, LLM-enhanced world query, current-future link, and joint geometric optimization, outperforming specialized methods in multiple benchmark tests and providing comprehensive capabilities for autonomous driving systems.

2

Section 02

Background: The Semantic and Physical Gap in Autonomous Driving World Models

World models are crucial for path planning and risk prediction in autonomous driving, but existing models have biases: most focus on future scene generation while ignoring current semantic understanding; although LLMs excel at reasoning, they lack physical intuition for geometric evolution. This gap between semantic understanding and physical simulation severely limits the overall performance of the system; intelligent driving needs to both understand the current scene and foresee future changes.

3

Section 03

Method 1: BEV Representation Unifies Spatial Information

HERMES++ uses Bird's-Eye View (BEV) representation as the basic architecture, integrating multi-camera spatial information into an LLM-compatible structure, which not only preserves the geometric relationships of the scene but also facilitates processing by language models. This method solves the problems of inconsistent perspectives and information redundancy in traditional multi-view fusion, laying the foundation for subsequent understanding and prediction tasks.

4

Section 04

Method 2: LLM-Enhanced World Query Mechanism

The system uses the semantic understanding capability of LLM to analyze the current scene (identify object categories, spatial relationships, infer intentions), encodes the results into world queries and injects them into the prediction module, realizing cross-task collaborative learning, so that geometric prediction is based on in-depth scene understanding rather than blind extrapolation.

5

Section 05

Method 3: Current-Future Link Explicitly Models the Temporal Dimension

A current-future link component is designed to condition geometric evolution on semantic context, ensuring that prediction results are physically reasonable and consistent with scene understanding (e.g., the point cloud change of a decelerating truck conforms to the deceleration mode), significantly improving prediction stability and credibility.

6

Section 06

Method 4: Joint Geometric Optimization Enhances Consistency

A joint geometric optimization strategy is introduced, combining explicit geometric constraints (coplanarity, parallelism, etc.) and implicit latent regularization (latent space smoothness), aligning internal representations with geometric perception priors, and generating future scenes that conform to physical laws and are visually coherent.

7

Section 07

Experimental Verification: Performance Exceeding Specialized Methods

HERMES++ outperforms all specialized methods in future point cloud prediction tasks, and also exceeds specialized methods focused on understanding in 3D scene understanding tasks; at the same time, it has strong cross-task transfer and generalization capabilities, proving that the unified framework does not sacrifice understanding ability but instead improves performance through prediction assistance.

8

Section 08

Conclusion and Outlook: Technical Significance and Industry Impact of the Unified World Model

HERMES++ marks a new stage in driving world models, proving that semantic understanding and geometric prediction can mutually enhance each other; at the industry level, more unified and efficient systems can be developed to reduce deployment and maintenance costs and improve robustness in complex scenarios; the methodology can be extended to robot operation, VR/AR and other fields; the team has open-sourced the model code to help the community promote the development of autonomous driving technology.