# ActionJEPA: A Vision-Language-Action Robot Learning System Based on the JEPA World Model

> ActionJEPA is a master's thesis project in Artificial Intelligence and Robotics at the University of Rome, combining the JEPA world model with the Vision-Language-Action (VLA) framework for learning and reasoning in robotic manipulation tasks.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-21T12:11:10.000Z
- 最近活动: 2026-05-21T12:20:45.748Z
- 热度: 148.8
- 关键词: JEPA, VLA, 机器人学习, 世界模型, 模仿学习, 视觉语言动作, Meta AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/actionjepa-jepa
- Canonical: https://www.zingnex.cn/forum/thread/actionjepa-jepa
- Markdown 来源: floors_fallback

---

## Introduction to the ActionJEPA Project

ActionJEPA is a master's thesis project in Artificial Intelligence and Robotics at the University of Rome. It combines the JEPA (Joint Embedding Predictive Architecture) world model proposed by Meta with the Vision-Language-Action (VLA) framework, aiming to improve the learning efficiency and generalization ability of robots in manipulation tasks, and solve the problems of traditional imitation learning such as high data demand and difficulty in handling out-of-distribution task changes.

## Project Background and Research Motivation

The field of robot learning has long faced core challenges: how to enable robots to efficiently learn complex manipulation skills from limited demonstration data and generalize to new scenarios. Traditional imitation learning requires large amounts of data and struggles to handle out-of-distribution task changes. In recent years, the combination of world models and VLA frameworks has provided new ideas to address this problem. ActionJEPA is a representative work in this direction, developed by a master's student in the AI and Robotics program at the University of Rome as the core result of their master's thesis.

## Integration of the JEPA World Model and VLA Framework

JEPA is a new world model architecture proposed by Yann LeCun's team. It uses a joint embedding predictive architecture and makes predictions in the representation space rather than the pixel space, offering advantages such as high computational efficiency, strong generalization ability, and good robustness. The VLA framework unifies visual perception, language understanding, and action execution into an end-to-end model, enabling physical operations based on language instructions. The innovation of ActionJEPA lies in the deep integration of the two: using JEPA to learn the environmental dynamics model, and realizing language-conditioned action generation through the VLA framework, which can predict future states and plan action sequences.

## Technical Implementation and System Architecture

ActionJEPA is implemented based on the LIBERO benchmark suite (a robotic manipulation learning evaluation framework). The core technology stack includes Meta's open-source JEPA world model, the LIBERO benchmark, and Hugging Face Transformers. The project solved the weight loading compatibility issue with PyTorch version 2.6+: modified the `torch.load` call to add the `weights_only=False` parameter, and provided a fix script to ensure stable system operation.

## Dataset and Training Process

ActionJEPA is trained using the LIBERO dataset, which includes multiple subsets (total ~100GB): libero_10 (13.7GB, 10 basic tasks), libero_90 (66.7GB, 90 diverse tasks), libero_goal (6.37GB), libero_object (7.44GB), and libero_spatial (6.24GB). Two download methods are supported: script download and Hugging Face Hub download. Training process: load pre-trained visual and language encoders, fine-tune on LIBERO data while training JEPA to learn environmental dynamics; during inference, receive language instructions and visual observations, VLA generates actions, and JEPA predicts future states and plans.

## Academic Contributions and Engineering Value

Academic value: explores a new paradigm of combining world models with VLA frameworks, which is expected to reduce the data demand of traditional VLA methods and improve generalization ability to new tasks and environments. Engineering value: provides a complete reproducible codebase (including installation guides, dataset scripts, and training configurations), open-sourced under the MIT license, to promote subsequent research and applications.

## Future Development Directions

Potential development directions for ActionJEPA include: expanding to real robot platforms; introducing stronger vision-language pre-trained models; implementing online learning and adaptation after deployment; integrating multi-modal perception fusion such as tactile and force sensing.
