Zing Forum

Reading

SandMLE: Accelerating Reinforcement Learning Training for Machine Learning Engineering Agents via Synthetic Sandboxes

This article introduces the SandMLE framework, which compresses dataset size to a micro-scale (50-200 samples) by generating diverse and verifiable synthetic MLE environments. This makes online policy reinforcement learning feasible for the first time in the MLE domain, with execution efficiency improved by over 13 times.

机器学习工程强化学习智能体训练合成数据MLE
Published 2026-04-07 01:19Recent activity 2026-04-07 16:09Estimated read 5 min
SandMLE: Accelerating Reinforcement Learning Training for Machine Learning Engineering Agents via Synthetic Sandboxes
1

Section 01

Introduction: SandMLE Framework – A Groundbreaking Solution to Accelerate RL Training for MLE Agents

This article introduces the innovative SandMLE framework, which compresses dataset size to a micro-scale of 50-200 samples by generating diverse and verifiable synthetic MLE environments. It addresses the bottleneck of high validation costs in training Machine Learning Engineering (MLE) agents, making online policy reinforcement learning feasible for the first time in this domain. Execution efficiency is improved by over 13 times, and it significantly outperforms existing supervised fine-tuning baselines in both performance and generalization ability.

2

Section 02

Core Bottlenecks in MLE Agent Training and Limitations of Existing Solutions

LLM agents have made significant progress in the software engineering domain, but when expanding to the MLE domain, they face the problem of high validation costs: MLE task validation requires a complete ML pipeline (data preprocessing, model training, metric evaluation) and relies on large-scale datasets, making online policy reinforcement learning almost infeasible. Existing solutions like supervised fine-tuning (SFT) lack exploration capabilities, and offline proxy rewards have target biases, both sacrificing the core advantages of reinforcement learning.

3

Section 03

Core Design and Implementation of the SandMLE Framework

The core insight of SandMLE is that sandbox data size is the root cause of the bottleneck. Therefore, it proposes a multi-agent synthetic environment generation framework: 1. Strictly constrain dataset size to 50-200 samples; 2. Preserve the structure and technical complexity of real MLE problems (diversity of data distributions, complete task flow, real technical challenges); 3. Generate diverse and reliable synthetic environments through multi-agent collaboration.

4

Section 04

Experimental Validation: Efficiency and Performance Breakthroughs of SandMLE

Experimental results on the MLE-bench-lite benchmark show: 1. Execution efficiency improved by over 13 times, making online policy RL feasible for the first time; 2. On Qwen3 series models, the medal rate increased by 20.3% to 66.9% relatively; 3. Excellent generalization ability: HumanRank score increased by 32.4% on the unseen MLE-Dojo architecture.

5

Section 05

Technical Contributions and Industry Value of SandMLE

The contributions of SandMLE include: Methodological breakthrough (feasibility of accelerating RL via environment synthesis, providing reference for other computationally intensive fields); Acceleration of practical applications (shortening experiment cycles, reducing R&D costs); Return to RL's core advantages (online exploration and trial-and-error learning). This framework is an important milestone in the field of MLE agent training, promoting the development of AI agents towards complex engineering tasks.

6

Section 06

Limitations of SandMLE and Future Improvement Directions

Current limitations and future directions: 1. There are differences between synthetic environments and real data; statistical property calibration needs to be optimized; 2. Expand task types to scenarios like reinforcement learning and generative modeling; 3. Explore hybrid training strategies of synthetic environments and real data to improve real-world performance.