Section 01
[Introduction] Vision Inference Former: Addressing the Visual Consistency Problem in Long Text Generation by Multimodal Large Models
This article introduces Vision Inference Former (VIF), a lightweight architectural module that solves the 'visual forgetting' problem—where visual information gradually fades during long text generation by multimodal large language models (MLLMs)—by continuously injecting visual semantics during the decoding phase. It effectively improves the quality of vision-language alignment with minimal additional computational overhead.