Zing Forum

Reading

Do Large Language Models Truly Understand High-Level Message Sequence Charts? An Empirical Study on Formal Semantics Comprehension

The study evaluates Gemini-3, GPT-5.4, and Qwen-3.6 on their understanding of the formal semantics of HMSC (the foundation of UML sequence diagrams). It finds an overall accuracy of only 52%, with particularly weak performance on complex semantic reasoning tasks such as abstract composition and trace analysis.

形式语义大语言模型UML消息序列图软件工程模型理解架构设计形式化方法
Published 2026-05-14 00:50Recent activity 2026-05-14 10:55Estimated read 6 min
Do Large Language Models Truly Understand High-Level Message Sequence Charts? An Empirical Study on Formal Semantics Comprehension
1

Section 01

[Introduction] Key Findings on LLMs' Ability to Understand HMSC Formal Semantics

The study evaluates Gemini-3, GPT-5.4, and Qwen-3.6 on their understanding of the formal semantics of HMSC (the foundation of UML sequence diagrams). It finds an overall accuracy of only 52%, with particularly weak performance on complex semantic reasoning tasks such as abstract composition and trace analysis, revealing that current LLMs still have a rather limited understanding of strict formal semantics.

2

Section 02

Background: The Importance of HMSC in Software Architecture Design

High-Level Message Sequence Charts (HMSC) are the formal foundation of UML sequence diagrams, with core values such as precise semantics, verifiability, and standardization (ITU-T Z.120). They are widely used in fields like communication protocol design and concurrent system modeling, and are of great significance in the design of critical systems such as telecommunications and aerospace.

3

Section 03

Research Methods: Evaluation Tasks and Experimental Setup

Research Question: Do LLMs truly understand the formal semantics of HMSC?

Evaluation Task Hierarchy:

  1. Basic semantic structure queries (event recognition, sequence relations, etc.)
  2. Semantic-preserving abstraction (event hiding, equivalence judgment, etc.)
  3. Compositional semantics (sequential/parallel/choice composition)
  4. Trace analysis and LTS computation (trace calculation, property verification, etc.)

Experimental Setup: Evaluate three models (Gemini-3, GPT-5.4, Qwen-3.6) using a zero-shot setting to test their intrinsic knowledge.

4

Section 04

Experimental Results: Overall Accuracy of 52%, Weak Performance on Complex Tasks

Overall Performance: The average accuracy of the three models is about 52%, slightly higher than random guesses but far from expert level.

Hierarchical Differences:

  • Basic concept tasks (event recognition, etc.): ~88% accuracy
  • Abstraction and composition reasoning tasks: ~36% accuracy
  • Trace analysis and LTS computation tasks: ~42% accuracy

Common Weaknesses: All models struggle to understand concepts like co-regions (concurrent execution) and explicit causal dependencies.

5

Section 05

In-depth Analysis: Reasons Why LLMs Struggle to Understand Formal Semantics

  1. Pattern matching vs. semantic understanding: Basic task performance relies on pattern matching, lacking a grasp of deep logical relationships;
  2. Statistical learning vs. formal reasoning: Formal tasks require precise mathematical reasoning, which is beyond the capability of statistical models;
  3. Training data bias: HMSC appears infrequently in pre-training data;
  4. Architectural limitations: Transformers need additional mechanisms to support tasks requiring explicit reasoning chains.
6

Section 06

Implications and Recommendations: Practical Guide for AI-Assisted Software Engineering

  1. Be cautious with formal tasks: Critical tasks require review and verification by human experts;
  2. Combine with symbolic methods: LLMs handle high-level interactions, while symbolic methods (model checking, etc.) handle precise reasoning;
  3. Domain-specific training: Train on domain data for key applications;
  4. Human-in-the-loop: Maintain the core role of human experts in decision-making.
7

Section 07

Future Research Directions: Paths to Improve LLMs' Formal Semantics Understanding

  1. Neuro-symbolic fusion: Develop hybrid architectures to compensate for the shortcomings of pure neural methods;
  2. Formal semantics pre-training: Pre-train on formal language data to enhance understanding;
  3. Interpretability research: Analyze the decision-making process of LLMs;
  4. Interactive learning: Build frameworks for interactive learning between models and human experts.