# Awesome Multimodal GUI Agents: Panoramic Map of Multimodal GUI Agent Research

> A carefully curated resource list for multimodal GUI agents, covering papers, datasets, benchmarks, models, and open-source projects across four domains: web agents, mobile agents, desktop agents, and computer usage agents.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-31T19:15:02.000Z
- 最近活动: 2026-05-31T19:20:14.297Z
- 热度: 149.9
- 关键词: GUI Agent, Multimodal Agent, Computer Use, Vision-Language Model, Web Agent, Mobile Agent, Desktop Agent, GUI Grounding, Screen Understanding, Action Prediction, Long-Horizon Automation, VLM
- 页面链接: https://www.zingnex.cn/en/forum/thread/awesome-multimodal-gui-agents-gui
- Canonical: https://www.zingnex.cn/forum/thread/awesome-multimodal-gui-agents-gui
- Markdown 来源: floors_fallback

---

## Core Introduction to the Awesome Multimodal GUI Agents Project

This article introduces the Awesome Multimodal GUI Agents project, a GitHub resource list maintained by DeLunnLi (original link: https://github.com/DeLunnLi/Awesome-Multimodal-GUI-Agents, updated on 2026-05-31). It systematically compiles papers, datasets, benchmarks, models, and open-source projects in the field of multimodal GUI agents, covering four domains: web, mobile, desktop, and computer usage agents. Featuring cross-platform integration, this project helps researchers identify commonalities in technologies across different platforms and serves as an essential resource for domain entry and cutting-edge trend tracking.

## Project Positioning and Cross-Platform Coverage Features

The project focuses on vision-language driven GUI agent research, with core capabilities including perceiving visual interfaces, understanding user instructions, reasoning about GUI states, and executing operations. Its scope includes both core method papers and supporting resources like benchmarks and datasets. Unlike single-platform resource lists, this project integrates web, mobile, desktop, and computer usage agents, reflecting the domain's trend toward generalization and cross-platform development, and helping researchers explore the possibility of technology migration.

## Core Research Directions and Technical Context

From the成果收录 in the project, key technical directions in the field are evident: 1. Training and Data Synthesis: Video2GUI/WildGUI extract large-scale GUI trajectory pre-training from unlabeled videos; RoTS improves error recovery capabilities through robustness-driven trajectory synthesis. 2. Model Architecture: UI-TARS-2 advances capabilities using multi-turn reinforcement learning; UI-Venus-1.5 provides complete technical reports and resources; EvoCUA explores scalable synthetic experience for evolutionary agents. 3. Evaluation and Benchmarks: MobileGym offers a mobile agent simulation platform; OpenComputer builds a desktop application environment; PhoneWorld expands the scale of mobile agent research.

## Latest Developments in the Field (2024-2026)

Recent progress tracked by the project includes: In May 2026, RoTS introduced the GUI-RobustEval benchmark and robust trajectory synthesis; MobileGym launched a verifiable mobile simulation platform; and related advances in OpenComputer, Video2GUI/WildGUI, and PhoneWorld. In early 2026, UI-Venus-1.5 released complete resources; EvoCUA explored the scaling of synthetic experience. In 2025, UI-TARS-2 (multi-turn reinforcement learning), OpenCUA (foundation for computer usage agents), and UI-R1, ScreenLLM, V-Droid filled gaps. In 2024, BrowserGym, AutoDroid-V2 enriched web and mobile research.

## Systematic Resource Classification System

The project establishes a structured classification system for easy retrieval, including: survey papers, GUI localization and screen understanding, general GUI agents, web/mobile/desktop/computer usage agents, evaluation and benchmarks, training/data synthesis and reinforcement learning, safety/robustness and security, open-source projects and tools. This classification covers from low-level visual understanding to high-level task planning, reducing the information retrieval burden for new researchers.

## Core Concepts and Usage Recommendations

Core keywords in the domain include GUI Agent, Multimodal Agent, Computer Use, VLM, Web/Mobile/Desktop Agent, GUI Grounding, Screen Understanding, Action Prediction, Long-Horizon Automation. Usage recommendations: 1. Browse the latest updates to grasp cutting-edge trends; 2. Use benchmarks and datasets for evaluation; 3. Explore open-source projects to gain code experience. The project welcomes community contributions (submit PRs) and helps researchers position their work in the domain.