# XTC-Bench: Cross-Task Consistency Evaluation for Unified Multimodal Models

> XTC-Bench, via a scene graph-driven evaluation framework and the CCTA metric, systematically evaluates the semantic consistency between understanding and generation tasks for unified multimodal models for the first time, and finds that high accuracy does not equal high consistency.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-27T23:57:29.000Z
- 最近活动: 2026-04-29T03:05:37.111Z
- 热度: 130.9
- 关键词: 统一多模态模型, 跨任务一致性, XTC-Bench, 场景图, 视觉理解, 视觉生成, 模型评测
- 页面链接: https://www.zingnex.cn/en/forum/thread/xtc-bench
- Canonical: https://www.zingnex.cn/forum/thread/xtc-bench
- Markdown 来源: floors_fallback

---

## XTC-Bench: A New Breakthrough in Cross-Task Consistency Evaluation for Unified Multimodal Models

This article introduces XTC-Bench—a scene graph-driven evaluation framework that, using the CCTA metric, systematically assesses the semantic consistency between understanding and generation tasks for unified multimodal models for the first time. Key findings include: high accuracy does not equal high consistency, and architectural unification does not imply representational unification, which provides critical insights for model development.

## Background: The Promise of Unified Multimodal Models and Cross-Task Consistency Issues

Unified multimodal models (uMMs) promise knowledge sharing, efficiency improvement, and semantic consistency, but existing evaluations independently assess understanding and generation capabilities without examining their semantic alignment. Cross-task consistency refers to the model's internal representation of the same visual concept remaining consistent across understanding (e.g., image captioning) and generation (e.g., text-to-image) tasks. A lack of consistency leads the model to only superficially match training data, greatly reducing its practicality.

## Methodology: XTC-Bench Evaluation Framework and CCTA Metric Design

XTC-Bench constructs a bidirectional evaluation based on scene graphs (structured semantic representations containing objects, attributes, and relationships): generating test images (for understanding tasks) and text prompts (for generation tasks) from scene graphs, then comparing their semantic facts. The CCTA metric performs continuous scoring at the atomic fact level (object existence, correct attributes, accurate relationships), isolating internal consistency from independent task accuracy to avoid confusion.

## Experimental Findings: High Accuracy ≠ Consistency, Architectural Unification ≠ Representational Unification

Evaluations on 9 models show: 1. Some high-accuracy models have low consistency; 2. Consistency is dominated by learning objective coupling, cross-modal alignment mechanisms, and training data diversity, rather than whether the architecture is unified; 3. Object consistency is high, attributes are moderate, and relationships are the lowest (spatial/interaction relationships are the hardest to unify).

## Architecture Analysis: Key Designs to Promote Cross-Task Consistency

In terms of representation sharing methods, partial sharing + strong alignment objectives achieve the best balance; in training strategies, multi-task joint training and curriculum learning are more likely to promote consistency, while pre-training + fine-tuning needs to add consistency regularization; full sharing may sacrifice single-task performance, and separate representations depend on alignment quality.

## Implications: Guiding Recommendations for Unified Multimodal Model Development

We need to go beyond isolated task metrics and explicitly measure consistency using XTC-Bench; design explicit cross-modal alignment objectives (e.g., contrastive learning); focus on relationship understanding (relationship consistency is a weakness); adopt hierarchical representation learning to decompose objects, attributes, and relationships.

## Limitations and Future Directions: Improvement Space and Research Directions for XTC-Bench

Current limitations: Scene graphs cover static concepts, lacking dynamic/abstract scenes; evaluation granularity is limited to atomic facts; insufficient domain generalization. Future directions: Expand dynamic scene graphs, fine-grained pixel-level alignment, interactive multi-turn dialogue evaluation.

## Conclusion: The Key from 'Seemingly Unified' to 'Truly Unified'

XTC-Bench reveals that the 'unification' of unified multimodal models needs to be explicitly measured, rather than relying on architectural assumptions. Only through cross-task consistency evaluation can we ensure that models establish truly shared semantic representations, pushing the field from superficial unification to substantive unification.