Zing Forum

Reading

InterleaveThinker: A Multi-Agent Framework Enabling Any Image Generator to Achieve Text-Image Interleaved Generation

InterleaveThinker is an innovative multi-agent framework that enables existing image generators to perform text-image interleaved generation through collaboration between planner and critic agents. Optimized via GRPO reinforcement learning, this method achieves performance comparable to GPT-5 on interleaved generation benchmarks while significantly enhancing the reasoning task performance of base models.

图像生成多智能体图文交错强化学习GRPO视觉叙事多模态
Published 2026-06-12 01:59Recent activity 2026-06-12 11:22Estimated read 7 min
InterleaveThinker: A Multi-Agent Framework Enabling Any Image Generator to Achieve Text-Image Interleaved Generation
1

Section 01

InterleaveThinker: Guide to the Multi-Agent Framework for Text-Image Interleaved Generation

InterleaveThinker is an innovative multi-agent framework that enables existing image generators to perform text-image interleaved generation through collaboration between planner and critic agents. Optimized via GRPO reinforcement learning, this method achieves performance comparable to GPT-5 on interleaved generation benchmarks while significantly enhancing the reasoning task performance of base models. Keywords: Image generation, multi-agent, text-image interleaving, reinforcement learning, GRPO, visual narrative, multimodality. Original source: arXiv, June 11, 2026, link http://arxiv.org/abs/2606.13679v1.

2

Section 02

Background: Advances in Image Generation and Challenges of Text-Image Interleaving

In recent years, image generation technologies (such as DALL-E, Stable Diffusion, FLUX) have performed excellently in single-image generation/editing, but they have architectural limitations: they cannot achieve "text-image-text..." interleaved generation, which is crucial for scenarios like visual narrative, step-by-step guidance, and embodied operations. Existing open-source unified multimodal models have limited performance in this task. Core challenges include: architectures lack sequence planning, self-evaluation, and iterative improvement capabilities; application needs cover fields like visual narrative, step-by-step guidance, and embodied operations.

3

Section 03

Method: Dual-Agent Architecture and GRPO Reinforcement Learning Training

InterleaveThinker adopts a dual-agent architecture: the planner decomposes tasks into ordered steps, generates instructions, and maintains state; the critic evaluates outputs, identifies deviations, and optimizes instructions. Training strategy: construct Interleave-Planner-SFT-80k (planner supervised fine-tuning) and Interleave-Critic-SFT-112k (critic supervised fine-tuning) datasets; optimize the critic via GRPO reinforcement learning using the Interleave-Critic-RL-13k dataset; design accuracy rewards (single-step quality) and step-level rewards (impact on subsequent steps) to achieve single-step optimization of the global trajectory.

4

Section 04

Evidence: Experimental Results and Performance

In text-image interleaved generation benchmark tests, InterleaveThinker's performance is comparable to Nano Banana and GPT-5. It can enhance the performance of various base image generators and has strong generality. Unexpected finding: this framework significantly enhances the reasoning ability of base models. For example, the 4-step FLUX.2-klein model shows improved performance on WISE and RISE reasoning benchmarks, suggesting that the general reasoning ability cultivated by text-image interleaving training is transferable.

5

Section 05

Technical Insight: Why the Multi-Agent Framework Works

  1. Task decomposition: Split complex text-image interleaved generation into planning and criticism subtasks to improve processing capability; 2. Iterative improvement: The critic introduces a "generate-evaluate-improve" cycle, similar to human creative processes; 3. Reinforcement learning generalization: GRPO training not only improves performance on specific tasks but also imparts general reasoning ability, with better cross-task generalization than single-task supervised learning.
6

Section 06

Application Prospects: Potential Cross-Domain Applications

  1. Content creation tools: Automatically generate comics, picture books, tutorials, etc.; 2. Educational applications: Dynamically generate personalized learning materials (text + diagrams); 3. Embodied intelligence: Help robots understand and execute complex visual-language instructions, and adjust plans based on feedback.
7

Section 07

Limitations and Future Directions

Current limitations: High computational cost, generation delay, error accumulation in long sequences. Future directions: Optimize efficiency to reduce generation calls; explore end-to-end multi-agent joint training; expand to interleaved generation of other modalities such as video and audio.

8

Section 08

Conclusion: Breakthroughs and Significance of InterleaveThinker

InterleaveThinker is an important breakthrough in the field of image generation. Through its multi-agent architecture and reinforcement learning, it endows existing models with text-image interleaved generation capabilities, achieving performance at the level of top proprietary models. More excitingly, cross-task capability transfer was discovered—the reasoning ability gained from text-image interleaving training can be generalized to other reasoning tasks, providing new ideas for the development of multimodal AI and worthy of in-depth research and application by developers.