Section 01
[Introduction] Alternating Visual Reasoner: Adaptive Strategy to Break the Dual Bottlenecks of Unified Multimodal Models
This paper addresses the understanding-generation gap in unified multimodal models and proposes an adaptive alternating generation framework. This framework allows the model to autonomously switch between three strategies—direct generation, self-reflection, and multi-step planning—based on instruction complexity. Combined with a hierarchical data pipeline and a two-stage training strategy, it significantly improves the fidelity and text alignment of arbitrary-to-image (X2I) generation tasks.