Section 01
[Introduction] HALO: A Multimodal Embodied Intelligence Model That Teaches Robots to "Think Before Acting"
HALO is a unified Vision-Language-Action (VLA) model that enables embodied multimodal chain-of-thought reasoning through the cognitive pathway of "Think-Imagine-Execute". This model adopts a hybrid Transformer architecture and achieves significantly better results than existing baselines on the RoboTwin 2.0 benchmark. Keywords: Embodied Intelligence, Vision-Language-Action Model, Chain-of-Thought Reasoning, Robot Learning, Multimodal Learning, Hybrid Transformer, ICML 2026.