Zing Forum

Reading

R2ABench: Evaluating Large Language Models' Ability to Generate Software Architecture from Requirements Documents

R2ABench is a new benchmark for evaluating large language models' ability to generate software architecture diagrams from requirements documents, and the study reveals fundamental flaws in LLMs' relational reasoning capabilities.

LLM软件架构基准测试架构生成PlantUML软件工程
Published 2026-04-08 12:58Recent activity 2026-04-09 09:50Estimated read 5 min
R2ABench: Evaluating Large Language Models' Ability to Generate Software Architecture from Requirements Documents
1

Section 01

[Introduction] R2ABench: Evaluating LLMs' Ability to Generate Software Architecture from Requirements

R2ABench is a new benchmark designed to evaluate large language models (LLMs) ability to generate software architecture diagrams from requirements documents. This study reveals fundamental flaws in LLMs' relational reasoning capabilities and provides a standardized evaluation foundation for LLM-driven software architecture generation research. This article will discuss aspects including background, methodology, evaluation results, and practical implications.

2

Section 02

Background: Challenges in Software Architecture Generation and Gaps in Existing Benchmarks

Software architecture design is a key step in transforming abstract requirements into system structures. In traditional processes, architects need to comprehensively consider factors such as functional/non-functional requirements and module dependencies. In recent years, LLMs have made significant progress in tasks like code generation, but research on high-level tasks like architecture design is scarce. The core obstacle is the lack of dedicated evaluation datasets—existing benchmarks either focus on code-level tasks or lack complete requirements documents and reference architectures from real projects.

3

Section 03

Methodology: Composition of R2ABench Benchmark and Three-Layer Evaluation Framework

The R2ABench benchmark includes complete Product Requirements Documents (PRDs) from real software projects and expert-annotated PlantUML reference architecture diagrams. The research team proposed a three-layer hybrid evaluation framework: 1. Structure diagram metrics (structural similarity such as number of nodes, edge relationships, connectivity); 2. Multi-dimensional scoring (accuracy of component identification, correctness of relationship types, rationality of hierarchy, etc.); 3. Architecture anti-pattern detection (identifying design flaws like circular dependencies and god objects).

4

Section 04

Evidence: Evaluation Findings on LLMs' Architecture Generation Capabilities

Evaluation results show LLMs' strengths: generating syntactically correct PlantUML diagrams and accurately extracting key entities (classes, modules, etc.). However, there are fundamental limitations: insufficient relational reasoning ability, difficulty in understanding complex component dependencies, leading to fragmented architecture structures. Additionally, code-specific models (such as CodeLlama) can mitigate this issue; while Agent frameworks did not bring stable improvements and instead increased volatility.

5

Section 05

Conclusion: Role of LLMs in Architecture Design

R2ABench provides a standardized evaluation foundation for LLM architecture generation research. Currently, LLMs are more suitable as auxiliary tools for architects rather than replacing human experts. Their deficiency in relational reasoning is a core shortcoming that requires targeted optimization.

6

Section 06

Recommendations: Future Research and Application Directions

Future directions include: 1. Optimizing LLMs' relational reasoning capabilities; 2. Tracking technological progress through standardized benchmarks like R2ABench; 3. Exploring the application of more stable Agent frameworks in architecture generation; 4. Promoting the practice of human-machine collaboration models in architecture design.