Zing Forum

Reading

MolmoAct2: An Open Visual-Language-Action Reasoning Model for Real-World Deployment

MolmoAct2 surpasses strong baselines like Pi-05 on 7 simulation and real-world benchmarks through its dedicated VLM backbone MolmoER, open-source action tokenizer OpenFAST, flow-matching action experts, and adaptive deep reasoning MolmoThink.

VLA modelroboticsvision-language-actionembodied AIopen sourceflow matchingbimanual manipulation具身智能机器人学习
Published 2026-05-05 01:51Recent activity 2026-05-05 11:50Estimated read 6 min
MolmoAct2: An Open Visual-Language-Action Reasoning Model for Real-World Deployment
1

Section 01

Introduction: MolmoAct2—A Breakthrough in Real-World Deployment of Open VLA Models

MolmoAct2 is a fully open-source Visual-Language-Action (VLA) model developed by the Allen AI team, designed specifically for real-world deployment. Through five core innovations (MolmoER backbone network, three new datasets, OpenFAST action tokenizer, flow-matching continuous action expert architecture, and MolmoThink adaptive reasoning), it outperforms strong baselines like Pi-05 on 7 simulation and real-world benchmarks, providing an open and scalable research platform for the robotics field.

2

Section 02

Background: The Dilemma of VLA Models from Lab to Real World

Current VLA systems face four major challenges:

  • Closedness: Most cutting-edge models are closed-source, making customization and optimization impossible
  • Hardware dependency: Open-source solutions are tied to expensive specialized hardware
  • Latency issues: Enhancing reasoning capabilities sacrifices real-time performance
  • Success rate bottleneck: Even after fine-tuning, it's still difficult to meet the threshold for reliable deployment These pain points restrict the implementation of VLA technology in industrial and service robotics fields.
3

Section 03

Core Methods: Five Key Innovations of MolmoAct2

  1. MolmoER Backbone: Optimized for spatial and embodied reasoning, using the "Specialize-then-Rehearse" training strategy, outperforming closed-source models like GPT-5
  2. Three New Datasets: Covering bimanual operations (MolmoAct2-BimanualYAM), high-signal-to-noise Franka subsets, and low-cost platform SO100/101 subsets
  3. OpenFAST Tokenizer: An open-source action discretization tool that breaks the limitations of closed-source/platform binding
  4. Flow-Matching Architecture: Integrates discrete token VLM with continuous action experts to achieve fine-grained control
  5. MolmoThink Reasoning: Adaptively updates areas of scene changes to reduce latency
4

Section 04

Experimental Evidence: Comprehensive Evaluation and Open-Source Commitment

  • Evaluation covers 7 simulation/real-world benchmarks, outperforming the strong baseline Pi-05
  • MolmoER outperforms GPT-5 and Gemini Robotics ER-1.5 on 13 embodied reasoning benchmarks
  • Cross-platform generalization: Adapts from Franka to low-cost SO100/101
  • Fully open-source: Model weights, training code, and data are all publicly available
5

Section 05

Technical Details: Training Strategy and Architecture Integration

  • Training Strategy: Specialize-then-Rehearse (first specialize in training robot tasks, then rehearse general data to avoid overfitting)
  • Flow-Matching Integration: Seamlessly connects continuous action experts with discrete token VLM through KV cache conditioning
  • Adaptive Reasoning: Scene change detection + sparse updates, maintaining accuracy while reducing latency
6

Section 06

Industry Significance: Lowering Barriers and Promoting Standardization

  • Lower research barriers: No need to rely on closed-source APIs or specialized hardware
  • Promote standardization: Tools like OpenFAST establish community reuse standards
  • Accelerate industrial implementation: Adapt to multi-cost platforms, paving the way for industrialization
7

Section 07

Limitations and Future: Unresolved Challenges and Directions

Limitations:

  • The scale of bimanual datasets is still smaller than industrial-grade
  • Simulation-to-real transfer is not fully resolved
  • Performance on long-range complex tasks needs verification

Future Directions:

  • Expand data scale and diversity
  • Strengthen simulation-to-real transfer technology
  • Improve long-range task planning capabilities
8

Section 08

Conclusion: A Milestone for Open-Source VLA Models

MolmoAct2 is an important milestone in the development of open-source VLA models. Not only does it outperform baselines in performance, but it also provides a fully open research platform. Its open-source nature will promote progress in the fields of robot learning and embodied intelligence, providing key infrastructure for industry implementation.