Zing Forum

Reading

Step-by-Step Optimization: A New Method to Dramatically Boost the Learning Efficiency of Computer Agents

This article introduces a new framework called Step-level Optimization (SO), which redefines agent training as a token-level optimization problem to achieve finer-grained credit assignment and more efficient learning. The method achieves competitive performance on the OSWorld benchmark while significantly reducing training steps and computational resource requirements.

computer-use agentstep-level optimizationdirect preference optimizationGUI automationreinforcement learningcredit assignmentOSWorld benchmarkAI efficiency
Published 2026-04-30 03:59Recent activity 2026-05-02 09:36Estimated read 5 min
Step-by-Step Optimization: A New Method to Dramatically Boost the Learning Efficiency of Computer Agents
1

Section 01

[Introduction] Step-by-Step Optimization: A New Framework to Improve the Learning Efficiency of Computer Agents

This article introduces the Step-level Optimization (SO) framework, which aims to address the bottlenecks of outcome-based optimization in computer agent training (such as difficulty in credit assignment and sparse learning signals). SO redefines training as token-level optimization, achieving competitive performance on the OSWorld benchmark while reducing training steps by over 60% and significantly improving learning efficiency.

2

Section 02

Dilemmas of Existing Methods: Three Bottlenecks of Outcome-based Optimization

Traditional computer agent training relies on outcome-based optimization in end-to-end reinforcement learning, which has three major issues: 1. Difficulty in credit assignment (hard to determine the contribution of each step in long trajectories); 2. Sparse learning signals (feedback is only received when the task is completed); 3. Low sample efficiency (requires large amounts of interaction data). These problems lead to long training times and high resource consumption.

3

Section 03

Core Idea of Step-level Optimization: Transition from Trajectories to Steps

The core of the SO framework lies in redefining training as token-level optimization: 1. Decompose trajectories into independent steps (e.g., clicks, inputs, etc.), where each step becomes a learning opportunity; 2. Step-level direct preference optimization (drawing on the idea of DPO), which constructs preference pairs by comparing the relative quality of candidate actions, avoiding complex reward functions and supporting learning from failures.

4

Section 04

Technical Implementation Details of the SO Framework

Technically, it includes: 1. Trajectory decomposition and step encoding (multimodal encoder fuses visual and structured information); 2. Preference data construction (three strategies: rules, models, and human feedback); 3. Optimization objectives (preference loss + consistency loss + exploration loss) and curriculum learning training strategy.

5

Section 05

Experimental Validation: Performance and Efficiency Improvements on the OSWorld Benchmark

On the OSWorld benchmark, SO achieves task success rates comparable to traditional methods, while training efficiency is significantly improved: it reduces training steps by over 60% to reach the same performance. Ablation experiments show that step decomposition, preference optimization, and consistency loss are all key components.

6

Section 06

Practical Application Value and Prospects of the SO Framework

The application value of SO includes: 1. Reducing training costs (decreasing GPU time and development cycles); 2. Supporting more complex long-trajectory tasks; 3. Can be combined with LLMs and vision-language models to expand capability boundaries.

7

Section 07

Limitations and Future Research Directions

Current SO has limitations: it only supports discrete action spaces, preference data construction requires domain knowledge, and theoretical analysis is insufficient. Future directions: expansion to continuous control, automated preference data construction, theoretical explanation, multi-agent collaboration, and cross-domain applications.

8

Section 08

Conclusion: Significance and Future Outlook of the SO Framework

SO addresses credit assignment and efficiency issues by refining training granularity, bringing important progress to computer agent training. As automation demands grow, efficient methods like SO will promote wider applications of agents in fields such as office work and testing, and are expected to be more intelligent and practical in the future.