Section 01
Unify-Agent: World Knowledge-Grounded Image Generation Based on Agent Architecture (Introduction)
Unified multimodal models are limited by parameterized knowledge when generating images of long-tail and knowledge-intensive concepts. Unify-Agent reframes image generation as an agent process, going through four stages—prompt understanding, multimodal evidence search, grounded re-description, and final synthesis. It significantly outperforms baselines on the FactIP benchmark and approaches the world knowledge capability of the strongest closed-source models.