# R2ABench: Evaluating Large Language Models' Ability to Generate Software Architecture from Requirements Documents

> R2ABench is a new benchmark for evaluating large language models' ability to generate software architecture diagrams from requirements documents, and the study reveals fundamental flaws in LLMs' relational reasoning capabilities.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-08T04:58:36.000Z
- 最近活动: 2026-04-09T01:50:13.671Z
- 热度: 117.1
- 关键词: LLM, 软件架构, 基准测试, 架构生成, PlantUML, 软件工程
- 页面链接: https://www.zingnex.cn/en/forum/thread/r2abench
- Canonical: https://www.zingnex.cn/forum/thread/r2abench
- Markdown 来源: floors_fallback

---

## [Introduction] R2ABench: Evaluating LLMs' Ability to Generate Software Architecture from Requirements

R2ABench is a new benchmark designed to evaluate large language models (LLMs) ability to generate software architecture diagrams from requirements documents. This study reveals fundamental flaws in LLMs' relational reasoning capabilities and provides a standardized evaluation foundation for LLM-driven software architecture generation research. This article will discuss aspects including background, methodology, evaluation results, and practical implications.

## Background: Challenges in Software Architecture Generation and Gaps in Existing Benchmarks

Software architecture design is a key step in transforming abstract requirements into system structures. In traditional processes, architects need to comprehensively consider factors such as functional/non-functional requirements and module dependencies. In recent years, LLMs have made significant progress in tasks like code generation, but research on high-level tasks like architecture design is scarce. The core obstacle is the lack of dedicated evaluation datasets—existing benchmarks either focus on code-level tasks or lack complete requirements documents and reference architectures from real projects.

## Methodology: Composition of R2ABench Benchmark and Three-Layer Evaluation Framework

The R2ABench benchmark includes complete Product Requirements Documents (PRDs) from real software projects and expert-annotated PlantUML reference architecture diagrams. The research team proposed a three-layer hybrid evaluation framework: 1. Structure diagram metrics (structural similarity such as number of nodes, edge relationships, connectivity); 2. Multi-dimensional scoring (accuracy of component identification, correctness of relationship types, rationality of hierarchy, etc.); 3. Architecture anti-pattern detection (identifying design flaws like circular dependencies and god objects).

## Evidence: Evaluation Findings on LLMs' Architecture Generation Capabilities

Evaluation results show LLMs' strengths: generating syntactically correct PlantUML diagrams and accurately extracting key entities (classes, modules, etc.). However, there are fundamental limitations: insufficient relational reasoning ability, difficulty in understanding complex component dependencies, leading to fragmented architecture structures. Additionally, code-specific models (such as CodeLlama) can mitigate this issue; while Agent frameworks did not bring stable improvements and instead increased volatility.

## Conclusion: Role of LLMs in Architecture Design

R2ABench provides a standardized evaluation foundation for LLM architecture generation research. Currently, LLMs are more suitable as auxiliary tools for architects rather than replacing human experts. Their deficiency in relational reasoning is a core shortcoming that requires targeted optimization.

## Recommendations: Future Research and Application Directions

Future directions include: 1. Optimizing LLMs' relational reasoning capabilities; 2. Tracking technological progress through standardized benchmarks like R2ABench; 3. Exploring the application of more stable Agent frameworks in architecture generation; 4. Promoting the practice of human-machine collaboration models in architecture design.
