# Do Large Language Models Truly Understand High-Level Message Sequence Charts? An Empirical Study on Formal Semantics Comprehension

> The study evaluates Gemini-3, GPT-5.4, and Qwen-3.6 on their understanding of the formal semantics of HMSC (the foundation of UML sequence diagrams). It finds an overall accuracy of only 52%, with particularly weak performance on complex semantic reasoning tasks such as abstract composition and trace analysis.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-13T16:50:51.000Z
- 最近活动: 2026-05-14T02:55:20.109Z
- 热度: 140.9
- 关键词: 形式语义, 大语言模型, UML, 消息序列图, 软件工程, 模型理解, 架构设计, 形式化方法
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-13773v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-13773v1
- Markdown 来源: floors_fallback

---

## [Introduction] Key Findings on LLMs' Ability to Understand HMSC Formal Semantics

The study evaluates Gemini-3, GPT-5.4, and Qwen-3.6 on their understanding of the formal semantics of HMSC (the foundation of UML sequence diagrams). It finds an overall accuracy of only 52%, with particularly weak performance on complex semantic reasoning tasks such as abstract composition and trace analysis, revealing that current LLMs still have a rather limited understanding of strict formal semantics.

## Background: The Importance of HMSC in Software Architecture Design

High-Level Message Sequence Charts (HMSC) are the formal foundation of UML sequence diagrams, with core values such as precise semantics, verifiability, and standardization (ITU-T Z.120). They are widely used in fields like communication protocol design and concurrent system modeling, and are of great significance in the design of critical systems such as telecommunications and aerospace.

## Research Methods: Evaluation Tasks and Experimental Setup

**Research Question**: Do LLMs truly understand the formal semantics of HMSC?

**Evaluation Task Hierarchy**: 
1. Basic semantic structure queries (event recognition, sequence relations, etc.)
2. Semantic-preserving abstraction (event hiding, equivalence judgment, etc.)
3. Compositional semantics (sequential/parallel/choice composition)
4. Trace analysis and LTS computation (trace calculation, property verification, etc.)

**Experimental Setup**: Evaluate three models (Gemini-3, GPT-5.4, Qwen-3.6) using a zero-shot setting to test their intrinsic knowledge.

## Experimental Results: Overall Accuracy of 52%, Weak Performance on Complex Tasks

**Overall Performance**: The average accuracy of the three models is about 52%, slightly higher than random guesses but far from expert level.

**Hierarchical Differences**: 
- Basic concept tasks (event recognition, etc.): ~88% accuracy
- Abstraction and composition reasoning tasks: ~36% accuracy
- Trace analysis and LTS computation tasks: ~42% accuracy

**Common Weaknesses**: All models struggle to understand concepts like co-regions (concurrent execution) and explicit causal dependencies.

## In-depth Analysis: Reasons Why LLMs Struggle to Understand Formal Semantics

1. **Pattern matching vs. semantic understanding**: Basic task performance relies on pattern matching, lacking a grasp of deep logical relationships;
2. **Statistical learning vs. formal reasoning**: Formal tasks require precise mathematical reasoning, which is beyond the capability of statistical models;
3. **Training data bias**: HMSC appears infrequently in pre-training data;
4. **Architectural limitations**: Transformers need additional mechanisms to support tasks requiring explicit reasoning chains.

## Implications and Recommendations: Practical Guide for AI-Assisted Software Engineering

1. **Be cautious with formal tasks**: Critical tasks require review and verification by human experts;
2. **Combine with symbolic methods**: LLMs handle high-level interactions, while symbolic methods (model checking, etc.) handle precise reasoning;
3. **Domain-specific training**: Train on domain data for key applications;
4. **Human-in-the-loop**: Maintain the core role of human experts in decision-making.

## Future Research Directions: Paths to Improve LLMs' Formal Semantics Understanding

1. **Neuro-symbolic fusion**: Develop hybrid architectures to compensate for the shortcomings of pure neural methods;
2. **Formal semantics pre-training**: Pre-train on formal language data to enhance understanding;
3. **Interpretability research**: Analyze the decision-making process of LLMs;
4. **Interactive learning**: Build frameworks for interactive learning between models and human experts.
