Zing Forum

Reading

HALO: A Multimodal Embodied Intelligence Model That Teaches Robots to "Think Before Acting"

HALO is a unified Vision-Language-Action (VLA) model that enables embodied multimodal chain-of-thought reasoning through the cognitive pathway of "Think-Imagine-Execute". This model adopts a hybrid Transformer architecture and achieves significantly better results than existing baselines on the RoboTwin 2.0 benchmark.

具身智能视觉语言行动模型思维链推理机器人学习多模态学习混合TransformerICML 2026
Published 2026-05-08 06:14Recent activity 2026-05-08 10:11Estimated read 6 min
HALO: A Multimodal Embodied Intelligence Model That Teaches Robots to "Think Before Acting"
1

Section 01

[Introduction] HALO: A Multimodal Embodied Intelligence Model That Teaches Robots to "Think Before Acting"

HALO is a unified Vision-Language-Action (VLA) model that enables embodied multimodal chain-of-thought reasoning through the cognitive pathway of "Think-Imagine-Execute". This model adopts a hybrid Transformer architecture and achieves significantly better results than existing baselines on the RoboTwin 2.0 benchmark. Keywords: Embodied Intelligence, Vision-Language-Action Model, Chain-of-Thought Reasoning, Robot Learning, Multimodal Learning, Hybrid Transformer, ICML 2026.

2

Section 02

Background: Reasoning Gap and Challenges in Embodied Intelligence

Current Vision-Language-Action (VLA) models have made significant progress in robot control tasks, but most directly map perceptual inputs to motion commands, lacking human-like deliberate reasoning capabilities. They are prone to error accumulation and insufficient generalization when facing complex multi-step tasks. Humans follow the cognitive pathway of "Think-Plan-Execute" when performing complex tasks; how to enable robots to have similar capabilities is an important challenge in the field of embodied intelligence.

3

Section 03

Methodology: HALO's Multimodal Chain-of-Thought Framework and Technical Innovations

HALO follows a three-stage cognitive pathway of "Think-Imagine-Execute": 1. The Think stage generates text reasoning trajectories and subtask plans; 2. The Imagine stage predicts visual subgoal images; 3. The Execute stage generates action sequences based on EM-CoT context. The core innovation is the hybrid Transformer (MoT) architecture, which includes three expert modules (multimodal understanding, visual generation, action prediction) that share a self-attention stack but have independent feed-forward networks. The training strategy includes an automatic EM-CoT data synthesis pipeline (action primitive extraction, VLM annotation, subgoal selection) and two-stage training (general pre-training + EM-CoT-enhanced fine-tuning).

4

Section 04

Experimental Evidence: Excellent Performance on the RoboTwin 2.0 Benchmark

HALO performed outstandingly on the RoboTwin 2.0 benchmark (50 manipulation tasks, 100 evaluations per task):

Method Easy Success Rate Hard Success Rate
Diffusion Policy 28.0% 0.6%
RDT-1B 34.5% 13.7%
π₀ 46.4% 16.3%
HALO (without EM-CoT) 75.3% 21.2%
HALO (full EM-CoT) 80.5% 26.4%
Key findings: Compared to the π₀ baseline, HALO improved the Easy task success rate by +34.1% and the Hard task by +10.1%; the variant without EM-CoT still outperformed the strongest baseline by +28.9%; EM-CoT provided an additional +5.2% improvement. Ablation studies showed that text and visual reasoning provide independent and additive gains, and all pre-training sources are valuable.
5

Section 05

Open-Source Resources: Accessibility of the HALO Project

The HALO project is fully open-source, providing: pre-trained weights (EMA checkpoints on HuggingFace), fine-tuned weights (full EM-CoT model), datasets (pre-training data on ModelScope and unannotated RoboTwin data), code implementations (training/inference/evaluation code), and the paper (accepted by ICML 2026, arXiv preprint available). It uses the Apache-2.0 license, is based on Python and PyTorch, and supports FSDP distributed training and EMA saving.

6

Section 06

Conclusion and Outlook: The Significance of HALO for the Field of Embodied Intelligence

HALO demonstrates the effectiveness of integrating human-like cognitive pathways (Think-Imagine-Execute) into VLA models. The MoT architecture provides new ideas for integrating multimodal heterogeneous capabilities, and the automatic EM-CoT data synthesis pipeline offers a feasible solution for large-scale training. For researchers, HALO is a powerful baseline and scalable framework; future directions can explore more complex reasoning patterns, richer modal integration, and broader robot application scenarios.