Reading

FSE 2026 Paper Reproduction: Multimodal Large Language Models Automatically Identify Interface Usability Issues

The research team from Graz University of Technology open-sourced the complete reproduction data for their FSE 2026 paper, demonstrating how to use MLLMs to analyze screen recording videos for automatic identification of usability issues and provision of improvement suggestions.

MLLM可用性评估UI/UX软件工程FSE 2026多模态大模型Nielsen启发式原则用户界面自动化测试

Published 2026-04-10 22:05Recent activity 2026-04-10 22:50Estimated read 5 min

FSE 2026 Paper Reproduction: Multimodal Large Language Models Automatically Identify Interface Usability Issues

Section 01

Introduction: Reproduction of FSE 2026 Research on MLLMs' Automatic Identification of Interface Usability Issues

The research team from Graz University of Technology open-sourced the complete reproduction data for their FSE 2026 paper, showing how to use Multimodal Large Language Models (MLLMs) to analyze screen recording videos, automatically identify interface usability issues based on Nielsen's heuristic principles, and provide sorted improvement suggestions. This method aims to lower the threshold for usability evaluation and provide practical UI/UX optimization solutions for teams with limited resources.

Section 02

Research Background and Motivation

Traditional usability evaluation requires professional experts, a lot of time and resources, which poses challenges for small teams. With the development of MLLMs' visual understanding capabilities, the research community is exploring their potential for automated usability evaluation. This research result has been accepted by the International Symposium on the Foundations of Software Engineering (FSE 2026).

Section 03

Overview of Core Methods

An innovative automated method is proposed: input application context information and user interaction screen recordings, MLLMs identify issues based on Nielsen's Ten Usability Heuristics, provide detailed explanations and improvement suggestions, and sort them by severity. The advantage is that no expert intervention is needed—only basic descriptions and screen recordings are required to obtain a structured analysis report.

Section 04

Dataset Composition and Experimental Design

The method's effectiveness was verified on two real-world applications:

EventHelpR (event management app): Includes screen recordings of tasks such as registration and event management for organizer/participant roles;
KnowledgeCheckR (knowledge quiz app): Contains screen recordings of scenarios like quiz participation and creation for student/teacher roles. Each task is accompanied by a structured task description JSON to facilitate experiment reproduction.

Section 05

Evaluation Results and Value

A user study with software engineers was conducted to evaluate the practicality, accuracy, and operability of the highest-priority suggestions. The results show that this method has low-investment improvement potential; although it cannot completely replace traditional evaluation, it can serve as a supplementary tool. The suggestions include problem descriptions, violated principles, severity levels, and improvement plans, providing developers with a clear path for fixes.

Section 06

Technical Implementation and Reproduction Guide

A complete reproduction package is provided: original screen recordings and task descriptions, JSON-formatted analysis reports, evaluation notebooks (browsing/reproduction modes), and anonymized user study data. Reproduction process: Clone the repository → Create a virtual environment → Install dependencies → Run the Jupyter Notebook.

Section 07

Significance, Limitations, and Future Directions

Significance: Lowers the evaluation threshold, expands MLLM application scenarios in software engineering, and lays the foundation for tool integration. Limitations: MLLMs may miss context-specific issues and depend on video quality. Future directions: Expand to mobile/AR/VR interfaces, dynamic evaluation, and fine-grained severity models.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15