Reading

DeNovoSWE: A Long-Horizon Software Engineering Dataset for Full Code Repository Generation

DeNovoSWE contains 4818 high-quality instances, automatically constructed via a sandboxed agent workflow using divide-and-conquer and critique-repair strategies, which improved Qwen3-30B-A3B's performance on the BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.

代码生成软件工程数据集构建长程任务仓库生成智能体训练Qwen3BeyondSWE

Published 2026-06-09 19:37Recent activity 2026-06-10 11:57Estimated read 6 min

DeNovoSWE: A Long-Horizon Software Engineering Dataset for Full Code Repository Generation

Section 01

DeNovoSWE Dataset: A Key Breakthrough in Long-Horizon Full Code Repository Generation

DeNovoSWE is a long-horizon software engineering dataset for full code repository generation, containing 4818 high-quality instances. It is automatically constructed via a sandboxed agent workflow (using divide-and-conquer and critique-repair strategies). This dataset improved the performance of the Qwen3-30B-A3B model on the BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%. Source: arXiv paper "DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch" (Link: http://arxiv.org/abs/2606.10728v1, published on 2026-06-09).

Section 02

Challenges from Local Bug Fixes to Full Repository Generation

LLM-based code agents are evolving from local bug fixes to full software repository generation, which involves multiple stages such as requirement understanding and architecture design, requiring higher long-horizon planning capabilities. However, the core barrier to training such agents is the lack of large-scale, verifiable full repository generation data—manual annotation costs are high, and existing open-source code repositories lack corresponding relationships with high-level specifications.

Section 03

Automated Construction Strategy of DeNovoSWE

DeNovoSWE is built using an innovative automated process: 1. Divide-and-conquer strategy: Decompose complex repository generation tasks into subtasks (e.g., project structure creation, core module implementation); 2. Critique-repair mechanism: Generated code undergoes execution verification and review by a critique module (functional correctness, style consistency, etc.), and issues found trigger repairs; 3. Sandboxed environment: Ensure safe code execution and automated test verification.

Section 04

Filtering Strategy for Balancing Quality and Diversity

To balance data quality and diversity, difficulty-aware trajectory filtering is introduced: 1. Difficulty evaluation dimensions: Number of code lines, number of files, dependency complexity, test pass rate, etc.; 2. Hierarchical sampling: Classify by difficulty level to ensure a reasonable distribution across all levels; 3. Diversity guarantee: Deduplicate similar generation paths, retain representative samples, and avoid overfitting.

Section 05

Significant Improvement in Model Performance

After fine-tuning Qwen3-30B-A3B with DeNovoSWE, its score on the BeyondSWE-Doc2Repo benchmark (testing full repository generation capability) increased from 5.8% to 47.2% (an 8-fold improvement). The model made progress in sub-dimensions such as project structure creation, core function implementation, and cross-module coordination, especially enhancing complex dependency handling and planning capabilities.

Section 06

Implications for Code Agent Research

The significance of DeNovoSWE: 1. Proves the feasibility of automatically generating high-quality long-horizon software engineering data; 2. The divide-and-conquer and critique-repair strategies can be extended to complex tasks such as multi-file editing and large-scale refactoring; 3. Difficulty-aware filtering provides new ideas for training data construction, and hierarchical sampling is more effective than uniform sampling.

Section 07

Current Limitations and Future Directions

Limitations: Mainly covers Python projects, and the repository scale still lags behind industrial-level ones. Future directions: Expand to more programming languages/frameworks, increase repository scale and complexity, introduce diverse specifications (natural language requirements, API contracts, etc.), and explore human-machine collaborative interactive generation modes.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23