Zing Forum

Reading

DeNovoSWE: A Long-Horizon Software Engineering Dataset for Full Code Repository Generation

DeNovoSWE contains 4818 high-quality instances, automatically constructed via a sandboxed agent workflow using divide-and-conquer and critique-repair strategies, which improved Qwen3-30B-A3B's performance on the BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.

代码生成软件工程数据集构建长程任务仓库生成智能体训练Qwen3BeyondSWE
Published 2026-06-09 19:37Recent activity 2026-06-10 11:57Estimated read 6 min
DeNovoSWE: A Long-Horizon Software Engineering Dataset for Full Code Repository Generation
1

Section 01

DeNovoSWE Dataset: A Key Breakthrough in Long-Horizon Full Code Repository Generation

DeNovoSWE is a long-horizon software engineering dataset for full code repository generation, containing 4818 high-quality instances. It is automatically constructed via a sandboxed agent workflow (using divide-and-conquer and critique-repair strategies). This dataset improved the performance of the Qwen3-30B-A3B model on the BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%. Source: arXiv paper "DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch" (Link: http://arxiv.org/abs/2606.10728v1, published on 2026-06-09).

2

Section 02

Challenges from Local Bug Fixes to Full Repository Generation

LLM-based code agents are evolving from local bug fixes to full software repository generation, which involves multiple stages such as requirement understanding and architecture design, requiring higher long-horizon planning capabilities. However, the core barrier to training such agents is the lack of large-scale, verifiable full repository generation data—manual annotation costs are high, and existing open-source code repositories lack corresponding relationships with high-level specifications.

3

Section 03

Automated Construction Strategy of DeNovoSWE

DeNovoSWE is built using an innovative automated process: 1. Divide-and-conquer strategy: Decompose complex repository generation tasks into subtasks (e.g., project structure creation, core module implementation); 2. Critique-repair mechanism: Generated code undergoes execution verification and review by a critique module (functional correctness, style consistency, etc.), and issues found trigger repairs; 3. Sandboxed environment: Ensure safe code execution and automated test verification.

4

Section 04

Filtering Strategy for Balancing Quality and Diversity

To balance data quality and diversity, difficulty-aware trajectory filtering is introduced: 1. Difficulty evaluation dimensions: Number of code lines, number of files, dependency complexity, test pass rate, etc.; 2. Hierarchical sampling: Classify by difficulty level to ensure a reasonable distribution across all levels; 3. Diversity guarantee: Deduplicate similar generation paths, retain representative samples, and avoid overfitting.

5

Section 05

Significant Improvement in Model Performance

After fine-tuning Qwen3-30B-A3B with DeNovoSWE, its score on the BeyondSWE-Doc2Repo benchmark (testing full repository generation capability) increased from 5.8% to 47.2% (an 8-fold improvement). The model made progress in sub-dimensions such as project structure creation, core function implementation, and cross-module coordination, especially enhancing complex dependency handling and planning capabilities.

6

Section 06

Implications for Code Agent Research

The significance of DeNovoSWE: 1. Proves the feasibility of automatically generating high-quality long-horizon software engineering data; 2. The divide-and-conquer and critique-repair strategies can be extended to complex tasks such as multi-file editing and large-scale refactoring; 3. Difficulty-aware filtering provides new ideas for training data construction, and hierarchical sampling is more effective than uniform sampling.

7

Section 07

Current Limitations and Future Directions

Limitations: Mainly covers Python projects, and the repository scale still lags behind industrial-level ones. Future directions: Expand to more programming languages/frameworks, increase repository scale and complexity, introduce diverse specifications (natural language requirements, API contracts, etc.), and explore human-machine collaborative interactive generation modes.