Zing Forum

Reading

PAGER: Bridging the Semantic-Execution Gap in Precise Control of Geometric GUI

PAGER is a topology-aware agent architecture specifically designed to solve the precise point control challenge in geometric construction GUI tasks. By leveraging structured dependency planning and pixel-level execution, PAGER increases task success rate from less than 6% to over 62%, setting a new standard for point-precise GUI control.

PAGERGUI智能体几何构造点精确控制视觉-语言模型强化学习拓扑感知PAGE基准测试
Published 2026-05-15 21:55Recent activity 2026-05-18 16:18Estimated read 5 min
PAGER: Bridging the Semantic-Execution Gap in Precise Control of Geometric GUI
1

Section 01

PAGER: Bridging the Semantic-Execution Gap in Precise Control of Geometric GUI [Introduction]

PAGER is a topology-aware agent architecture specifically designed to solve the precise point control challenge in geometric construction GUI tasks. By combining structured planning and pixel-level execution, it increases task success rate from less than 6% to over 62%, setting a new standard for point-precise GUI control. This article provides a detailed analysis of this research.

2

Section 02

Research Background and Problem Definition

Large vision-language models (VLMs) perform well in regular GUI interactions relying on the "forgiving region tolerance" paradigm, but fail in geometric construction tasks due to the need for pixel-level precise operations and geometric dependencies. The study defines "precision-sensitive GUI tasks", whose characteristics include point-level precision requirements, geometry-aware verification, and dependency-driven robustness against error propagation.

3

Section 03

Introduction to the PAGE Benchmark Dataset

The research team built the PAGE (Point-precise Agent GEometry) benchmark dataset, which contains 4906 problems and over 224,000 pixel-level action annotations. It uses process-level supervision, covers geometric construction scenarios from basic to complex, and evaluates agent performance in layers according to complexity.

4

Section 04

Core Design of the PAGER Architecture

The PAGER architecture consists of two phases: 1. Structured Dependency Planning: Analyze the geometric construction dependency graph to determine construction order, constraint propagation, and key nodes; 2. Pixel-level Execution: Achieve precise operations through pixel-anchored supervised fine-tuning (learning precise coordinate action syntax) and precision-aligned reinforcement learning (state-conditional geometric feedback for real-time deviation adjustment).

5

Section 05

Experimental Results and Key Findings

Experiments reveal that general multimodal models have a "semantic-execution gap" (action type accuracy over 88% but task success rate <6%); PAGER shows significant performance improvement: task success rate increases from <6% to 24.6% (4.1x), single-step success rate from <9% to over 62% (6.9x); its advantages lie in dependency awareness, error control, and long-range planning capabilities.

6

Section 06

Technical Contributions and Application Prospects

Theoretical contributions: Propose a new direction for GUI automation from region tolerance to point-precise control, which requires explicit modeling of spatial precision and topological constraints, tight coupling of planning and execution, and refinement of supervision signals to the pixel level; Application prospects include CAD, scientific visualization, educational software, and graphic design; The team will open-source the PAGE benchmark and PAGER model.

7

Section 07

Limitations and Future Directions

Current limitations: Generalization ability needs improvement, low computational efficiency, reliance on offline learning; Future directions: Extend to 3D geometric construction, integrate large language model reasoning capabilities, and develop human-machine collaboration frameworks.