Section 01
Herculis-CUA-GUI-Actioner-4B: Core Overview of Multi-modal GUI Interaction Model
Herculis-CUA-GUI-Actioner-4B is a multi-modal large language model focused on graphical user interface (GUI) interaction, with UI positioning, visual grounding, and action execution capabilities. As a Computer Use Agent (CUA), it can understand screenshots, identify interface elements, and perform operations like clicks and text input to automate tasks across web, desktop, and mobile platforms. It addresses the limitations of traditional automation tools (relying on predefined scripts or DOM parsing) by adopting a human-like interaction paradigm.