Zing Forum

Reading

AQuaUI: Visual Token Compression for GUI Agents via Adaptive Quadtree

AQuaUI is a method that compresses visual tokens of GUI Agents during the inference phase without retraining. It identifies and merges visually homogeneous regions using an adaptive quadtree, reducing visual tokens by 29.52% while retaining 99.06% of performance.

GUI Agent视觉Token压缩四叉树多模态模型推理优化LMM空间冗余时序一致性
Published 2026-05-19 10:13Recent activity 2026-05-20 15:48Estimated read 6 min
AQuaUI: Visual Token Compression for GUI Agents via Adaptive Quadtree
1

Section 01

Introduction: AQuaUI—A Retraining-Free Visual Token Compression Scheme for GUI Agents

AQuaUI is a method to compress visual tokens of GUI Agents during the inference phase without retraining. By using an adaptive quadtree to identify and merge visually homogeneous regions, it reduces visual tokens by 29.52% while retaining 99.06% of performance, effectively addressing the computational overhead issue when GUI Agents process high-resolution screenshots.

2

Section 02

Background and Challenges: Visual Input Redundancy in GUI Agents

With the widespread application of Large Multimodal Models (LMMs) in the field of GUI Agents, models need to process high-resolution screenshots. However, these screenshots contain a lot of visual redundancy (e.g., solid-color backgrounds, repeated textures), and the proportion of key information is small. Traditional methods face a dilemma: either retain the complete screenshot leading to high computational costs, or compress tokens via attention but ignore the structured layout and spatial redundancy of GUIs. Existing solutions also have issues like additional training costs or insufficient temporal consistency.

3

Section 03

Core Method: Token Compression Strategy Using Adaptive Quadtree

AQuaUI leverages the spatial structure of quadtree to adaptively divide screen regions based on information density:

  1. Adaptive Quadtree Construction: Analyze the distribution of spatial information, perform coarse-grained division for low-information-density regions, retain fine-grained details for high-information regions, and merge homogeneous regions into representative tokens;
  2. Spatial Position Preservation Mechanism: Preserve the original spatial positions of merged tokens to ensure the normal operation of the downstream position encoding module;
  3. Temporal Consistency Optimization: Introduce a conditional quadtree algorithm, refer to the quadtree structure of the previous state, keep the division of static regions, and only recalculate changed regions, improving efficiency and maintaining stability across time steps.
4

Section 04

Experimental Evidence: Balance Between Efficiency and Performance

On standard GUI positioning and navigation benchmarks, after integrating AQuaUI into the GUI-Owl-1.5-32B-Instruct model, it achieved a 13.22% inference speedup, reduced visual tokens by 29.52%, and retained 99.06% of performance (only a drop of less than 1%). This verifies the hypothesis that GUI screenshots have safely compressible spatial redundancy and that this can be effectively utilized during the inference phase without retraining.

5

Section 05

Technical Significance and Application Prospects: Optimization Path for Resource-Constrained Scenarios

The significance of AQuaUI lies in opening up a new path for optimizing multimodal inference efficiency using input spatial structure, which has practical value for the deployment of GUI Agents in resource-constrained environments (such as mobile devices and edge computing). Its framework is extensible: in the future, we can explore more complex region importance evaluation, or apply it to other visual inputs like document images and web page screenshots; the conditional quadtree idea can also inspire temporal visual tasks.

6

Section 06

Conclusion: Win-Win of Efficient Compression and Performance Preservation

AQuaUI achieves efficient visual token compression for GUI Agents via adaptive quadtree, significantly improving inference efficiency with almost no performance loss. It provides a feasible optimization path for large-scale deployment of GUI Agents and contributes new ideas to the field of visual token compression.