# Step-by-Step Optimization: A New Method to Dramatically Boost the Learning Efficiency of Computer Agents

> This article introduces a new framework called Step-level Optimization (SO), which redefines agent training as a token-level optimization problem to achieve finer-grained credit assignment and more efficient learning. The method achieves competitive performance on the OSWorld benchmark while significantly reducing training steps and computational resource requirements.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-29T19:59:36.000Z
- 最近活动: 2026-05-02T01:36:27.409Z
- 热度: 106.4
- 关键词: computer-use agent, step-level optimization, direct preference optimization, GUI automation, reinforcement learning, credit assignment, OSWorld benchmark, AI efficiency
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-27151v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-27151v1
- Markdown 来源: floors_fallback

---

## [Introduction] Step-by-Step Optimization: A New Framework to Improve the Learning Efficiency of Computer Agents

This article introduces the Step-level Optimization (SO) framework, which aims to address the bottlenecks of outcome-based optimization in computer agent training (such as difficulty in credit assignment and sparse learning signals). SO redefines training as token-level optimization, achieving competitive performance on the OSWorld benchmark while reducing training steps by over 60% and significantly improving learning efficiency.

## Dilemmas of Existing Methods: Three Bottlenecks of Outcome-based Optimization

Traditional computer agent training relies on outcome-based optimization in end-to-end reinforcement learning, which has three major issues: 1. Difficulty in credit assignment (hard to determine the contribution of each step in long trajectories); 2. Sparse learning signals (feedback is only received when the task is completed); 3. Low sample efficiency (requires large amounts of interaction data). These problems lead to long training times and high resource consumption.

## Core Idea of Step-level Optimization: Transition from Trajectories to Steps

The core of the SO framework lies in redefining training as token-level optimization: 1. Decompose trajectories into independent steps (e.g., clicks, inputs, etc.), where each step becomes a learning opportunity; 2. Step-level direct preference optimization (drawing on the idea of DPO), which constructs preference pairs by comparing the relative quality of candidate actions, avoiding complex reward functions and supporting learning from failures.

## Technical Implementation Details of the SO Framework

Technically, it includes: 1. Trajectory decomposition and step encoding (multimodal encoder fuses visual and structured information); 2. Preference data construction (three strategies: rules, models, and human feedback); 3. Optimization objectives (preference loss + consistency loss + exploration loss) and curriculum learning training strategy.

## Experimental Validation: Performance and Efficiency Improvements on the OSWorld Benchmark

On the OSWorld benchmark, SO achieves task success rates comparable to traditional methods, while training efficiency is significantly improved: it reduces training steps by over 60% to reach the same performance. Ablation experiments show that step decomposition, preference optimization, and consistency loss are all key components.

## Practical Application Value and Prospects of the SO Framework

The application value of SO includes: 1. Reducing training costs (decreasing GPU time and development cycles); 2. Supporting more complex long-trajectory tasks; 3. Can be combined with LLMs and vision-language models to expand capability boundaries.

## Limitations and Future Research Directions

Current SO has limitations: it only supports discrete action spaces, preference data construction requires domain knowledge, and theoretical analysis is insufficient. Future directions: expansion to continuous control, automated preference data construction, theoretical explanation, multi-agent collaboration, and cross-domain applications.

## Conclusion: Significance and Future Outlook of the SO Framework

SO addresses credit assignment and efficiency issues by refining training granularity, bringing important progress to computer agent training. As automation demands grow, efficient methods like SO will promote wider applications of agents in fields such as office work and testing, and are expected to be more intelligent and practical in the future.
