Zing Forum

Reading

IMUG-Bench: An Evaluation Benchmark for Interleaved Text-Image Dialogue Capabilities of Unified Multimodal Models

IMUG-Bench is the first to systematically evaluate the performance of unified multimodal models (UMMs) in multi-turn interleaved text-image dialogues, revealing that mainstream models have significant exposure bias on the generation side and verifying the effectiveness of test-time scaling strategies.

统一多模态模型图文对话评测基准曝光偏差测试时缩放思维链多轮交互
Published 2026-06-08 16:08Recent activity 2026-06-09 13:28Estimated read 11 min
IMUG-Bench: An Evaluation Benchmark for Interleaved Text-Image Dialogue Capabilities of Unified Multimodal Models
1

Section 01

Introduction: IMUG-Bench—A New Evaluation Benchmark for Interleaved Text-Image Dialogue Capabilities of Unified Multimodal Models

Core Insights: IMUG-Bench is the first evaluation benchmark to systematically assess the performance of unified multimodal models (UMMs) in multi-turn interleaved text-image dialogues. It reveals that mainstream models have significant exposure bias on the generation side and verifies the effectiveness of test-time scaling strategies.

Source Information:

This benchmark fills the gap in existing evaluations for dynamic multi-turn interaction scenarios and provides key references for the development of UMMs.

2

Section 02

Research Background: Challenges of Unified Multimodal Models and Limitations of Existing Benchmarks

Rise of Unified Multimodal Models

In recent years, unified multimodal models (UMMs) have become an important direction in the AI field, supporting both understanding and generation tasks within a single framework and processing multimodal inputs and outputs such as images and text.

Challenges in Real-World Scenarios

UMMs face challenges in dynamic multi-turn interleaved text-image dialogues: they need to understand text and images in dialogue history, generate appropriate text-image responses, and maintain multi-turn consistency (e.g., a user first asks about a scenic spot, then follows up with a question about local food and requests an image).

Limitations of Existing Benchmarks

  • Single-turn or static settings: Most only test single-turn or static text-image pairs
  • Ignore exposure bias: Do not consider exposure bias in multi-turn interactions
  • Lack dynamic understanding: Do not support complex dynamic scenarios

These limitations mean existing benchmarks cannot fully evaluate the practical application capabilities of UMMs.

3

Section 03

IMUG-Bench Benchmark Design: Detailed Dataset and Category Explanation

IMUG-Bench is the first comprehensive evaluation benchmark for the multi-turn interleaved text-image dialogue capabilities of UMMs, with the following design:

Dataset Scale

  • 3,113 samples covering diverse real-world scenarios
  • 12,034 interaction turns, with an average of about 4 turns per sample

Three Categories

  1. Static Spatial Category: Focuses on spatial relationships and object attributes, e.g., "How many people are in the picture?", requiring fine-grained visual understanding and spatial reasoning
  2. Temporal Causal Category: Involves temporal and causal relationships, e.g., "Based on the previous images, what will happen next?", requiring temporal reasoning and cross-image association
  3. Mixed Category: Complex scenarios combining static spatial and temporal causal aspects, requiring comprehensive capabilities and modal switching

Dynamic Understanding Questions

Specifically designed dynamic understanding questions require models to track changes in dialogue state, update understanding, and handle information conflicts, which are closer to real interactions.

4

Section 04

Experimental Findings: Capability Boundaries of UMMs and Exposure Bias on the Generation Side

Evaluation Model Scope

Covers mainstream open-source models (LLaVA, Qwen-VL, InternVL, etc.) and closed-source models (GPT-4V/GPT-4o, Gemini, etc.).

Capability Boundaries

  • Understanding Side: Performs well on static spatial questions, but still faces challenges in temporal understanding and fine-grained localization
  • Generation Side: Image generation quality varies, text is prone to deviating from the topic, and cross-modal consistency is poor

Failure Modes

Common failures: Context forgetting, modal confusion, hallucination generation, style drift

Key Finding: Significant Exposure Bias on the Generation Side

Exposure bias refers to the mismatch between training and inference caused by exposure to self-generated samples during training, leading to error accumulation and lack of diversity. In multi-turn dialogues, it manifests as: Performance degradation with increasing turns, intensified bias during modal switching, and over-reliance on recent context.

5

Section 05

Validation of the Effectiveness of Test-Time Scaling Strategies

The study verifies that multiple test-time scaling strategies can effectively improve generation accuracy and mitigate exposure bias:

  1. Chain of Thought (CoT): Step-by-step reasoning before generation improves generation quality by 15-25% and logical consistency, but increases computational overhead by 2-3 times
  2. Self-Validation: Generate multiple candidates and self-evaluate to select the best, improving accuracy by 10-20% and reducing errors and hallucinations
  3. Best-of-N Sampling: Generate N candidates and select the highest-scoring one, significantly improving generation tasks with better image quality and text coherence

Comprehensive Strategy: Combining strategies (e.g., CoT + Best-of-N) can achieve the best results, and adaptive strategies dynamically select based on tasks.

6

Section 06

Implications and Recommendations for UMM Development

Architecture Design

  • Balance understanding encoders and generation decoders
  • Enhance long-range memory mechanisms
  • Improve cross-modal representation consistency

Training Strategies

  • Introduce adversarial training and curriculum learning to mitigate exposure bias
  • Train using real multi-turn dialogue data
  • Learn multi-turn interaction strategies from human feedback

Evaluation Methods

  • Adopt dynamic evaluation to test multi-turn interaction capabilities
  • Use evaluation data closer to real applications
  • Deeply analyze performance across different capability dimensions

These recommendations provide clear guidance for the optimization direction of UMMs.

7

Section 07

Limitations and Future Directions

Limitations of IMUG-Bench

  • Scale limitation: 3K+ samples are still insufficient
  • Language limitation: Mainly focuses on English scenarios
  • Domain coverage: Insufficient coverage of professional fields such as medical and legal

Future Research Directions

  • Build larger-scale evaluation datasets
  • Expand to multilingual scenarios (Chinese, Japanese, etc.)
  • Evaluate model performance in real-time dialogues
  • Assess the model's ability to adapt to personal preferences

Future efforts are needed to further improve the benchmark to promote the practical application of UMMs.

8

Section 08

Conclusion: Significance and Value of IMUG-Bench

IMUG-Bench represents an important progress in UMM evaluation. By systematically assessing multi-turn interleaved text-image dialogue capabilities, it reveals the current models' capability boundaries and the problem of exposure bias on the generation side.

The effectiveness of test-time scaling strategies (e.g., Chain of Thought, Self-Validation) provides practical guidance for real-world deployment. This work emphasizes that evaluation is not just about scoring, but more about understanding the model's capabilities and limitations, thereby guiding future research and development and推动 UMMs toward true practicality.