Reading

IMUG-Bench: An Evaluation Benchmark for Interleaved Text-Image Dialogue Capabilities of Unified Multimodal Models

IMUG-Bench is the first to systematically evaluate the performance of unified multimodal models (UMMs) in multi-turn interleaved text-image dialogues, revealing that mainstream models have significant exposure bias on the generation side and verifying the effectiveness of test-time scaling strategies.

统一多模态模型图文对话评测基准曝光偏差测试时缩放思维链多轮交互

Published 2026-06-08 16:08Recent activity 2026-06-09 13:28Estimated read 11 min

IMUG-Bench: An Evaluation Benchmark for Interleaved Text-Image Dialogue Capabilities of Unified Multimodal Models

Section 01

Introduction: IMUG-Bench—A New Evaluation Benchmark for Interleaved Text-Image Dialogue Capabilities of Unified Multimodal Models

Core Insights: IMUG-Bench is the first evaluation benchmark to systematically assess the performance of unified multimodal models (UMMs) in multi-turn interleaved text-image dialogues. It reveals that mainstream models have significant exposure bias on the generation side and verifies the effectiveness of test-time scaling strategies.

Source Information:

Original authors: arXiv paper team
Source platform: arXiv
Publication time: June 8, 2026
Original link: http://arxiv.org/abs/2606.09169v1

This benchmark fills the gap in existing evaluations for dynamic multi-turn interaction scenarios and provides key references for the development of UMMs.

Section 02

Research Background: Challenges of Unified Multimodal Models and Limitations of Existing Benchmarks

Rise of Unified Multimodal Models

In recent years, unified multimodal models (UMMs) have become an important direction in the AI field, supporting both understanding and generation tasks within a single framework and processing multimodal inputs and outputs such as images and text.

Challenges in Real-World Scenarios

UMMs face challenges in dynamic multi-turn interleaved text-image dialogues: they need to understand text and images in dialogue history, generate appropriate text-image responses, and maintain multi-turn consistency (e.g., a user first asks about a scenic spot, then follows up with a question about local food and requests an image).

Limitations of Existing Benchmarks

Single-turn or static settings: Most only test single-turn or static text-image pairs
Ignore exposure bias: Do not consider exposure bias in multi-turn interactions
Lack dynamic understanding: Do not support complex dynamic scenarios

These limitations mean existing benchmarks cannot fully evaluate the practical application capabilities of UMMs.

Section 03

IMUG-Bench Benchmark Design: Detailed Dataset and Category Explanation

IMUG-Bench is the first comprehensive evaluation benchmark for the multi-turn interleaved text-image dialogue capabilities of UMMs, with the following design:

Dataset Scale

3,113 samples covering diverse real-world scenarios
12,034 interaction turns, with an average of about 4 turns per sample

Three Categories

Static Spatial Category: Focuses on spatial relationships and object attributes, e.g., "How many people are in the picture?", requiring fine-grained visual understanding and spatial reasoning
Temporal Causal Category: Involves temporal and causal relationships, e.g., "Based on the previous images, what will happen next?", requiring temporal reasoning and cross-image association
Mixed Category: Complex scenarios combining static spatial and temporal causal aspects, requiring comprehensive capabilities and modal switching

Dynamic Understanding Questions

Specifically designed dynamic understanding questions require models to track changes in dialogue state, update understanding, and handle information conflicts, which are closer to real interactions.

Section 04

Experimental Findings: Capability Boundaries of UMMs and Exposure Bias on the Generation Side

Evaluation Model Scope

Covers mainstream open-source models (LLaVA, Qwen-VL, InternVL, etc.) and closed-source models (GPT-4V/GPT-4o, Gemini, etc.).

Capability Boundaries

Understanding Side: Performs well on static spatial questions, but still faces challenges in temporal understanding and fine-grained localization
Generation Side: Image generation quality varies, text is prone to deviating from the topic, and cross-modal consistency is poor

Failure Modes

Common failures: Context forgetting, modal confusion, hallucination generation, style drift

Key Finding: Significant Exposure Bias on the Generation Side

Exposure bias refers to the mismatch between training and inference caused by exposure to self-generated samples during training, leading to error accumulation and lack of diversity. In multi-turn dialogues, it manifests as: Performance degradation with increasing turns, intensified bias during modal switching, and over-reliance on recent context.

Section 05

Validation of the Effectiveness of Test-Time Scaling Strategies

The study verifies that multiple test-time scaling strategies can effectively improve generation accuracy and mitigate exposure bias:

Chain of Thought (CoT): Step-by-step reasoning before generation improves generation quality by 15-25% and logical consistency, but increases computational overhead by 2-3 times
Self-Validation: Generate multiple candidates and self-evaluate to select the best, improving accuracy by 10-20% and reducing errors and hallucinations
Best-of-N Sampling: Generate N candidates and select the highest-scoring one, significantly improving generation tasks with better image quality and text coherence

Comprehensive Strategy: Combining strategies (e.g., CoT + Best-of-N) can achieve the best results, and adaptive strategies dynamically select based on tasks.

Section 06

Implications and Recommendations for UMM Development

Architecture Design

Balance understanding encoders and generation decoders
Enhance long-range memory mechanisms
Improve cross-modal representation consistency

Training Strategies

Introduce adversarial training and curriculum learning to mitigate exposure bias
Train using real multi-turn dialogue data
Learn multi-turn interaction strategies from human feedback

Evaluation Methods

Adopt dynamic evaluation to test multi-turn interaction capabilities
Use evaluation data closer to real applications
Deeply analyze performance across different capability dimensions

These recommendations provide clear guidance for the optimization direction of UMMs.

Section 07

Limitations and Future Directions

Limitations of IMUG-Bench

Scale limitation: 3K+ samples are still insufficient
Language limitation: Mainly focuses on English scenarios
Domain coverage: Insufficient coverage of professional fields such as medical and legal

Future Research Directions

Build larger-scale evaluation datasets
Expand to multilingual scenarios (Chinese, Japanese, etc.)
Evaluate model performance in real-time dialogues
Assess the model's ability to adapt to personal preferences

Future efforts are needed to further improve the benchmark to promote the practical application of UMMs.

Section 08

Conclusion: Significance and Value of IMUG-Bench

IMUG-Bench represents an important progress in UMM evaluation. By systematically assessing multi-turn interleaved text-image dialogue capabilities, it reveals the current models' capability boundaries and the problem of exposure bias on the generation side.

The effectiveness of test-time scaling strategies (e.g., Chain of Thought, Self-Validation) provides practical guidance for real-world deployment. This work emphasizes that evaluation is not just about scoring, but more about understanding the model's capabilities and limitations, thereby guiding future research and development and推动 UMMs toward true practicality.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49