Zing Forum

Reading

Revisiting Web Agent Observation Compression: A Lightweight Evaluation Framework Based on Minimal Failure Sets

The research team proposes Minimal Failure Sets (MFS) as a proxy metric for HTML compression effectiveness, achieving over 100x evaluation speedup. They optimized pruning programs based on MFS, reducing latency by 2-3x while maintaining 84-89% success rates on WorkArena and WebLinx.

Web AgentHTML压缩最小失败集MFS观察值压缩覆盖率WorkArenaWebLinxAgent评估推理加速
Published 2026-05-28 13:46Recent activity 2026-05-29 13:53Estimated read 4 min
Revisiting Web Agent Observation Compression: A Lightweight Evaluation Framework Based on Minimal Failure Sets
1

Section 01

[Introduction] New Framework for Web Agent Observation Compression: MFS Enables Evaluation Speedup and Performance Optimization

Web Agents based on large language models are constrained by the problem of excessively long HTML observations. The latest research proposes Minimal Failure Sets (MFS) as a proxy metric for HTML compression effectiveness, achieving over 100x evaluation speedup. Pruning programs optimized based on MFS reduce latency by 2-3x while maintaining 84-89% task success rates on WorkArena and WebLinx.

2

Section 02

Background: Observation Dilemmas of Web Agents and Existing Evaluation Challenges

Web Agents rely on HTML as perceptual input, but modern web page HTML has issues like length explosion (over 100k tokens), information dilution (many irrelevant elements), and dynamic changes. Existing compression methods include rule-based pruning, similarity deduplication, and importance-based selection, but end-to-end evaluation costs are extremely high (e.g., evaluating 11 methods on WorkArena L1 takes 232.4 hours), hindering method iteration.

3

Section 03

Method: Minimal Failure Sets (MFS) and Coverage Metric

The study defines Minimal Failure Sets (MFS) as the minimal set of elements that cause task failure, with necessity and minimality. Based on MFS, a coverage metric is proposed (a value of 1 if all MFS elements are retained after compression). Coverage can be calculated without Web access or LLM inference, is strongly positively correlated with end-to-end success rates, and achieves over 100x evaluation speedup.

4

Section 04

Evidence: Experimental Results of MFS-Optimized Pruning Programs

By collecting MFS data and optimizing pruning programs, the optimized programs performed excellently on test sets: WorkArena L1 saw a 2.2x latency reduction while maintaining an 84% success rate; WebLinx saw a 3.1x latency reduction while maintaining an 89% success rate. This verifies the effectiveness of the MFS framework in compressing observations while retaining key information.

5

Section 05

Conclusion: Value and Key Findings of the MFS Framework

The MFS framework provides a lightweight evaluation tool for Web Agent observation compression, driving the field from experience-driven to systematic evaluation. Key findings include: Extractive methods struggle to balance efficiency and generality; MFS is stable across similar tasks with good generalization; Key elements are concentrated in specific areas (e.g., forms, buttons).

6

Section 06

Recommendations and Limitations: Deployment Guidance and Future Research Directions

Deployment recommendations: Offline optimization of compression programs, continuous iterative updates, hybrid strategy (using full HTML for critical tasks). Limitations: MFS computation still has overhead, difficulty adapting to dynamic pages, no extension to multimodality. Future directions: Approximate MFS estimation, dynamic content update mechanisms, multimodal extension.