Zing Forum

Reading

FSE 2026 Paper Reproduction: Multimodal Large Language Models Automatically Identify Interface Usability Issues

The research team from Graz University of Technology open-sourced the complete reproduction data for their FSE 2026 paper, demonstrating how to use MLLMs to analyze screen recording videos for automatic identification of usability issues and provision of improvement suggestions.

MLLM可用性评估UI/UX软件工程FSE 2026多模态大模型Nielsen启发式原则用户界面自动化测试
Published 2026-04-10 22:05Recent activity 2026-04-10 22:50Estimated read 5 min
FSE 2026 Paper Reproduction: Multimodal Large Language Models Automatically Identify Interface Usability Issues
1

Section 01

Introduction: Reproduction of FSE 2026 Research on MLLMs' Automatic Identification of Interface Usability Issues

The research team from Graz University of Technology open-sourced the complete reproduction data for their FSE 2026 paper, showing how to use Multimodal Large Language Models (MLLMs) to analyze screen recording videos, automatically identify interface usability issues based on Nielsen's heuristic principles, and provide sorted improvement suggestions. This method aims to lower the threshold for usability evaluation and provide practical UI/UX optimization solutions for teams with limited resources.

2

Section 02

Research Background and Motivation

Traditional usability evaluation requires professional experts, a lot of time and resources, which poses challenges for small teams. With the development of MLLMs' visual understanding capabilities, the research community is exploring their potential for automated usability evaluation. This research result has been accepted by the International Symposium on the Foundations of Software Engineering (FSE 2026).

3

Section 03

Overview of Core Methods

An innovative automated method is proposed: input application context information and user interaction screen recordings, MLLMs identify issues based on Nielsen's Ten Usability Heuristics, provide detailed explanations and improvement suggestions, and sort them by severity. The advantage is that no expert intervention is needed—only basic descriptions and screen recordings are required to obtain a structured analysis report.

4

Section 04

Dataset Composition and Experimental Design

The method's effectiveness was verified on two real-world applications:

  • EventHelpR (event management app): Includes screen recordings of tasks such as registration and event management for organizer/participant roles;
  • KnowledgeCheckR (knowledge quiz app): Contains screen recordings of scenarios like quiz participation and creation for student/teacher roles. Each task is accompanied by a structured task description JSON to facilitate experiment reproduction.
5

Section 05

Evaluation Results and Value

A user study with software engineers was conducted to evaluate the practicality, accuracy, and operability of the highest-priority suggestions. The results show that this method has low-investment improvement potential; although it cannot completely replace traditional evaluation, it can serve as a supplementary tool. The suggestions include problem descriptions, violated principles, severity levels, and improvement plans, providing developers with a clear path for fixes.

6

Section 06

Technical Implementation and Reproduction Guide

A complete reproduction package is provided: original screen recordings and task descriptions, JSON-formatted analysis reports, evaluation notebooks (browsing/reproduction modes), and anonymized user study data. Reproduction process: Clone the repository → Create a virtual environment → Install dependencies → Run the Jupyter Notebook.

7

Section 07

Significance, Limitations, and Future Directions

Significance: Lowers the evaluation threshold, expands MLLM application scenarios in software engineering, and lays the foundation for tool integration. Limitations: MLLMs may miss context-specific issues and depend on video quality. Future directions: Expand to mobile/AR/VR interfaces, dynamic evaluation, and fine-grained severity models.