# IMUG-Bench: An Evaluation Benchmark for Interleaved Text-Image Dialogue Capabilities of Unified Multimodal Models

> IMUG-Bench is the first to systematically evaluate the performance of unified multimodal models (UMMs) in multi-turn interleaved text-image dialogues, revealing that mainstream models have significant exposure bias on the generation side and verifying the effectiveness of test-time scaling strategies.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-08T08:08:20.000Z
- 最近活动: 2026-06-09T05:28:56.804Z
- 热度: 136.7
- 关键词: 统一多模态模型, 图文对话, 评测基准, 曝光偏差, 测试时缩放, 思维链, 多轮交互
- 页面链接: https://www.zingnex.cn/en/forum/thread/imug-bench-5b42fe99
- Canonical: https://www.zingnex.cn/forum/thread/imug-bench-5b42fe99
- Markdown 来源: floors_fallback

---

## Introduction: IMUG-Bench—A New Evaluation Benchmark for Interleaved Text-Image Dialogue Capabilities of Unified Multimodal Models

**Core Insights**: IMUG-Bench is the first evaluation benchmark to systematically assess the performance of unified multimodal models (UMMs) in multi-turn interleaved text-image dialogues. It reveals that mainstream models have significant exposure bias on the generation side and verifies the effectiveness of test-time scaling strategies.

**Source Information**:
- Original authors: arXiv paper team
- Source platform: arXiv
- Publication time: June 8, 2026
- Original link: http://arxiv.org/abs/2606.09169v1

This benchmark fills the gap in existing evaluations for dynamic multi-turn interaction scenarios and provides key references for the development of UMMs.

## Research Background: Challenges of Unified Multimodal Models and Limitations of Existing Benchmarks

### Rise of Unified Multimodal Models
In recent years, unified multimodal models (UMMs) have become an important direction in the AI field, supporting both understanding and generation tasks within a single framework and processing multimodal inputs and outputs such as images and text.

### Challenges in Real-World Scenarios
UMMs face challenges in dynamic multi-turn interleaved text-image dialogues: they need to understand text and images in dialogue history, generate appropriate text-image responses, and maintain multi-turn consistency (e.g., a user first asks about a scenic spot, then follows up with a question about local food and requests an image).

### Limitations of Existing Benchmarks
- Single-turn or static settings: Most only test single-turn or static text-image pairs
- Ignore exposure bias: Do not consider exposure bias in multi-turn interactions
- Lack dynamic understanding: Do not support complex dynamic scenarios

These limitations mean existing benchmarks cannot fully evaluate the practical application capabilities of UMMs.

## IMUG-Bench Benchmark Design: Detailed Dataset and Category Explanation

IMUG-Bench is the first comprehensive evaluation benchmark for the multi-turn interleaved text-image dialogue capabilities of UMMs, with the following design:

### Dataset Scale
- 3,113 samples covering diverse real-world scenarios
- 12,034 interaction turns, with an average of about 4 turns per sample

### Three Categories
1. **Static Spatial Category**: Focuses on spatial relationships and object attributes, e.g., "How many people are in the picture?", requiring fine-grained visual understanding and spatial reasoning
2. **Temporal Causal Category**: Involves temporal and causal relationships, e.g., "Based on the previous images, what will happen next?", requiring temporal reasoning and cross-image association
3. **Mixed Category**: Complex scenarios combining static spatial and temporal causal aspects, requiring comprehensive capabilities and modal switching

### Dynamic Understanding Questions
Specifically designed dynamic understanding questions require models to track changes in dialogue state, update understanding, and handle information conflicts, which are closer to real interactions.

## Experimental Findings: Capability Boundaries of UMMs and Exposure Bias on the Generation Side

### Evaluation Model Scope
Covers mainstream open-source models (LLaVA, Qwen-VL, InternVL, etc.) and closed-source models (GPT-4V/GPT-4o, Gemini, etc.).

### Capability Boundaries
- **Understanding Side**: Performs well on static spatial questions, but still faces challenges in temporal understanding and fine-grained localization
- **Generation Side**: Image generation quality varies, text is prone to deviating from the topic, and cross-modal consistency is poor

### Failure Modes
Common failures: Context forgetting, modal confusion, hallucination generation, style drift

### Key Finding: Significant Exposure Bias on the Generation Side
Exposure bias refers to the mismatch between training and inference caused by exposure to self-generated samples during training, leading to error accumulation and lack of diversity. In multi-turn dialogues, it manifests as: Performance degradation with increasing turns, intensified bias during modal switching, and over-reliance on recent context.

## Validation of the Effectiveness of Test-Time Scaling Strategies

The study verifies that multiple test-time scaling strategies can effectively improve generation accuracy and mitigate exposure bias:

1. **Chain of Thought (CoT)**: Step-by-step reasoning before generation improves generation quality by 15-25% and logical consistency, but increases computational overhead by 2-3 times
2. **Self-Validation**: Generate multiple candidates and self-evaluate to select the best, improving accuracy by 10-20% and reducing errors and hallucinations
3. **Best-of-N Sampling**: Generate N candidates and select the highest-scoring one, significantly improving generation tasks with better image quality and text coherence

**Comprehensive Strategy**: Combining strategies (e.g., CoT + Best-of-N) can achieve the best results, and adaptive strategies dynamically select based on tasks.

## Implications and Recommendations for UMM Development

### Architecture Design
- Balance understanding encoders and generation decoders
- Enhance long-range memory mechanisms
- Improve cross-modal representation consistency

### Training Strategies
- Introduce adversarial training and curriculum learning to mitigate exposure bias
- Train using real multi-turn dialogue data
- Learn multi-turn interaction strategies from human feedback

### Evaluation Methods
- Adopt dynamic evaluation to test multi-turn interaction capabilities
- Use evaluation data closer to real applications
- Deeply analyze performance across different capability dimensions

These recommendations provide clear guidance for the optimization direction of UMMs.

## Limitations and Future Directions

### Limitations of IMUG-Bench
- Scale limitation: 3K+ samples are still insufficient
- Language limitation: Mainly focuses on English scenarios
- Domain coverage: Insufficient coverage of professional fields such as medical and legal

### Future Research Directions
- Build larger-scale evaluation datasets
- Expand to multilingual scenarios (Chinese, Japanese, etc.)
- Evaluate model performance in real-time dialogues
- Assess the model's ability to adapt to personal preferences

Future efforts are needed to further improve the benchmark to promote the practical application of UMMs.

## Conclusion: Significance and Value of IMUG-Bench

IMUG-Bench represents an important progress in UMM evaluation. By systematically assessing multi-turn interleaved text-image dialogue capabilities, it reveals the current models' capability boundaries and the problem of exposure bias on the generation side.

The effectiveness of test-time scaling strategies (e.g., Chain of Thought, Self-Validation) provides practical guidance for real-world deployment. This work emphasizes that evaluation is not just about scoring, but more about understanding the model's capabilities and limitations, thereby guiding future research and development and推动 UMMs toward true practicality.
