# PAGER: Bridging the Semantic-Execution Gap in Precise Control of Geometric GUI

> PAGER is a topology-aware agent architecture specifically designed to solve the precise point control challenge in geometric construction GUI tasks. By leveraging structured dependency planning and pixel-level execution, PAGER increases task success rate from less than 6% to over 62%, setting a new standard for point-precise GUI control.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-15T13:55:05.000Z
- 最近活动: 2026-05-18T08:18:39.811Z
- 热度: 84.6
- 关键词: PAGER, GUI智能体, 几何构造, 点精确控制, 视觉-语言模型, 强化学习, 拓扑感知, PAGE基准测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/pager
- Canonical: https://www.zingnex.cn/forum/thread/pager
- Markdown 来源: floors_fallback

---

## PAGER: Bridging the Semantic-Execution Gap in Precise Control of Geometric GUI [Introduction]

PAGER is a topology-aware agent architecture specifically designed to solve the precise point control challenge in geometric construction GUI tasks. By combining structured planning and pixel-level execution, it increases task success rate from less than 6% to over 62%, setting a new standard for point-precise GUI control. This article provides a detailed analysis of this research.

## Research Background and Problem Definition

Large vision-language models (VLMs) perform well in regular GUI interactions relying on the "forgiving region tolerance" paradigm, but fail in geometric construction tasks due to the need for pixel-level precise operations and geometric dependencies. The study defines "precision-sensitive GUI tasks", whose characteristics include point-level precision requirements, geometry-aware verification, and dependency-driven robustness against error propagation.

## Introduction to the PAGE Benchmark Dataset

The research team built the PAGE (Point-precise Agent GEometry) benchmark dataset, which contains 4906 problems and over 224,000 pixel-level action annotations. It uses process-level supervision, covers geometric construction scenarios from basic to complex, and evaluates agent performance in layers according to complexity.

## Core Design of the PAGER Architecture

The PAGER architecture consists of two phases: 1. Structured Dependency Planning: Analyze the geometric construction dependency graph to determine construction order, constraint propagation, and key nodes; 2. Pixel-level Execution: Achieve precise operations through pixel-anchored supervised fine-tuning (learning precise coordinate action syntax) and precision-aligned reinforcement learning (state-conditional geometric feedback for real-time deviation adjustment).

## Experimental Results and Key Findings

Experiments reveal that general multimodal models have a "semantic-execution gap" (action type accuracy over 88% but task success rate <6%); PAGER shows significant performance improvement: task success rate increases from <6% to 24.6% (4.1x), single-step success rate from <9% to over 62% (6.9x); its advantages lie in dependency awareness, error control, and long-range planning capabilities.

## Technical Contributions and Application Prospects

Theoretical contributions: Propose a new direction for GUI automation from region tolerance to point-precise control, which requires explicit modeling of spatial precision and topological constraints, tight coupling of planning and execution, and refinement of supervision signals to the pixel level; Application prospects include CAD, scientific visualization, educational software, and graphic design; The team will open-source the PAGE benchmark and PAGER model.

## Limitations and Future Directions

Current limitations: Generalization ability needs improvement, low computational efficiency, reliance on offline learning; Future directions: Extend to 3D geometric construction, integrate large language model reasoning capabilities, and develop human-machine collaboration frameworks.
