# HALO: A Multimodal Embodied Intelligence Model That Teaches Robots to "Think Before Acting"

> HALO is a unified Vision-Language-Action (VLA) model that enables embodied multimodal chain-of-thought reasoning through the cognitive pathway of "Think-Imagine-Execute". This model adopts a hybrid Transformer architecture and achieves significantly better results than existing baselines on the RoboTwin 2.0 benchmark.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-07T22:14:58.000Z
- 最近活动: 2026-05-08T02:11:55.190Z
- 热度: 136.1
- 关键词: 具身智能, 视觉语言行动模型, 思维链推理, 机器人学习, 多模态学习, 混合Transformer, ICML 2026
- 页面链接: https://www.zingnex.cn/en/forum/thread/halo
- Canonical: https://www.zingnex.cn/forum/thread/halo
- Markdown 来源: floors_fallback

---

## [Introduction] HALO: A Multimodal Embodied Intelligence Model That Teaches Robots to "Think Before Acting"

HALO is a unified Vision-Language-Action (VLA) model that enables embodied multimodal chain-of-thought reasoning through the cognitive pathway of "Think-Imagine-Execute". This model adopts a hybrid Transformer architecture and achieves significantly better results than existing baselines on the RoboTwin 2.0 benchmark. Keywords: Embodied Intelligence, Vision-Language-Action Model, Chain-of-Thought Reasoning, Robot Learning, Multimodal Learning, Hybrid Transformer, ICML 2026.

## Background: Reasoning Gap and Challenges in Embodied Intelligence

Current Vision-Language-Action (VLA) models have made significant progress in robot control tasks, but most directly map perceptual inputs to motion commands, lacking human-like deliberate reasoning capabilities. They are prone to error accumulation and insufficient generalization when facing complex multi-step tasks. Humans follow the cognitive pathway of "Think-Plan-Execute" when performing complex tasks; how to enable robots to have similar capabilities is an important challenge in the field of embodied intelligence.

## Methodology: HALO's Multimodal Chain-of-Thought Framework and Technical Innovations

HALO follows a three-stage cognitive pathway of "Think-Imagine-Execute": 1. The Think stage generates text reasoning trajectories and subtask plans; 2. The Imagine stage predicts visual subgoal images; 3. The Execute stage generates action sequences based on EM-CoT context. The core innovation is the hybrid Transformer (MoT) architecture, which includes three expert modules (multimodal understanding, visual generation, action prediction) that share a self-attention stack but have independent feed-forward networks. The training strategy includes an automatic EM-CoT data synthesis pipeline (action primitive extraction, VLM annotation, subgoal selection) and two-stage training (general pre-training + EM-CoT-enhanced fine-tuning).

## Experimental Evidence: Excellent Performance on the RoboTwin 2.0 Benchmark

HALO performed outstandingly on the RoboTwin 2.0 benchmark (50 manipulation tasks, 100 evaluations per task):
| Method | Easy Success Rate | Hard Success Rate |
|---|---|---|
| Diffusion Policy | 28.0% | 0.6% |
| RDT-1B | 34.5% |13.7% |
| π₀ |46.4% |16.3% |
| HALO (without EM-CoT) |75.3% |21.2% |
| HALO (full EM-CoT) |80.5% |26.4% |
Key findings: Compared to the π₀ baseline, HALO improved the Easy task success rate by +34.1% and the Hard task by +10.1%; the variant without EM-CoT still outperformed the strongest baseline by +28.9%; EM-CoT provided an additional +5.2% improvement. Ablation studies showed that text and visual reasoning provide independent and additive gains, and all pre-training sources are valuable.

## Open-Source Resources: Accessibility of the HALO Project

The HALO project is fully open-source, providing: pre-trained weights (EMA checkpoints on HuggingFace), fine-tuned weights (full EM-CoT model), datasets (pre-training data on ModelScope and unannotated RoboTwin data), code implementations (training/inference/evaluation code), and the paper (accepted by ICML 2026, arXiv preprint available). It uses the Apache-2.0 license, is based on Python and PyTorch, and supports FSDP distributed training and EMA saving.

## Conclusion and Outlook: The Significance of HALO for the Field of Embodied Intelligence

HALO demonstrates the effectiveness of integrating human-like cognitive pathways (Think-Imagine-Execute) into VLA models. The MoT architecture provides new ideas for integrating multimodal heterogeneous capabilities, and the automatic EM-CoT data synthesis pipeline offers a feasible solution for large-scale training. For researchers, HALO is a powerful baseline and scalable framework; future directions can explore more complex reasoning patterns, richer modal integration, and broader robot application scenarios.
