Section 01
[Overview] Object-Centric Multimodal Vision: A New Paradigm from Scene Understanding to Precise Manipulation
This article reviews the progress of integrating large multimodal models (LMMs) with object-centric visual technologies, and explores technical breakthroughs and challenges in four key directions: understanding, segmentation, editing, and generation. Addressing the limitations of traditional LMMs in object-level localization, fine-grained spatial reasoning, and controllable visual manipulation, it proposes an object-centric visual framework to extend capabilities from the scene level to the object level. The article also covers modeling paradigms, learning strategies, evaluation protocols, and open challenges, pointing out the significant value of this field for academic research and applications such as robotics and autonomous driving.