# FSE 2026 Paper Reproduction: Multimodal Large Language Models Automatically Identify Interface Usability Issues

> The research team from Graz University of Technology open-sourced the complete reproduction data for their FSE 2026 paper, demonstrating how to use MLLMs to analyze screen recording videos for automatic identification of usability issues and provision of improvement suggestions.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-10T14:05:58.000Z
- 最近活动: 2026-04-10T14:50:48.841Z
- 热度: 152.3
- 关键词: MLLM, 可用性评估, UI/UX, 软件工程, FSE 2026, 多模态大模型, Nielsen启发式原则, 用户界面, 自动化测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/fse-2026
- Canonical: https://www.zingnex.cn/forum/thread/fse-2026
- Markdown 来源: floors_fallback

---

## Introduction: Reproduction of FSE 2026 Research on MLLMs' Automatic Identification of Interface Usability Issues

The research team from Graz University of Technology open-sourced the complete reproduction data for their FSE 2026 paper, showing how to use Multimodal Large Language Models (MLLMs) to analyze screen recording videos, automatically identify interface usability issues based on Nielsen's heuristic principles, and provide sorted improvement suggestions. This method aims to lower the threshold for usability evaluation and provide practical UI/UX optimization solutions for teams with limited resources.

## Research Background and Motivation

Traditional usability evaluation requires professional experts, a lot of time and resources, which poses challenges for small teams. With the development of MLLMs' visual understanding capabilities, the research community is exploring their potential for automated usability evaluation. This research result has been accepted by the International Symposium on the Foundations of Software Engineering (FSE 2026).

## Overview of Core Methods

An innovative automated method is proposed: input application context information and user interaction screen recordings, MLLMs identify issues based on Nielsen's Ten Usability Heuristics, provide detailed explanations and improvement suggestions, and sort them by severity. The advantage is that no expert intervention is needed—only basic descriptions and screen recordings are required to obtain a structured analysis report.

## Dataset Composition and Experimental Design

The method's effectiveness was verified on two real-world applications:
- EventHelpR (event management app): Includes screen recordings of tasks such as registration and event management for organizer/participant roles;
- KnowledgeCheckR (knowledge quiz app): Contains screen recordings of scenarios like quiz participation and creation for student/teacher roles.
Each task is accompanied by a structured task description JSON to facilitate experiment reproduction.

## Evaluation Results and Value

A user study with software engineers was conducted to evaluate the practicality, accuracy, and operability of the highest-priority suggestions. The results show that this method has low-investment improvement potential; although it cannot completely replace traditional evaluation, it can serve as a supplementary tool. The suggestions include problem descriptions, violated principles, severity levels, and improvement plans, providing developers with a clear path for fixes.

## Technical Implementation and Reproduction Guide

A complete reproduction package is provided: original screen recordings and task descriptions, JSON-formatted analysis reports, evaluation notebooks (browsing/reproduction modes), and anonymized user study data. Reproduction process: Clone the repository → Create a virtual environment → Install dependencies → Run the Jupyter Notebook.

## Significance, Limitations, and Future Directions

Significance: Lowers the evaluation threshold, expands MLLM application scenarios in software engineering, and lays the foundation for tool integration.
Limitations: MLLMs may miss context-specific issues and depend on video quality.
Future directions: Expand to mobile/AR/VR interfaces, dynamic evaluation, and fine-grained severity models.
