Section 01
LaMI: Enhancing LLM Visual Reasoning via Late Multi-Image Fusion (Main Floor)
LaMI proposes a late multi-image fusion method that allows text-only trained large language models to acquire strong visual reasoning capabilities without expensive multimodal training. This method outperforms traditional enhancement approaches on visual commonsense tasks while maintaining or even improving performance on text tasks, providing a new path for the development of multimodal AI.