The rapid development of Multimodal Large Language Models (Multimodal LLMs) has opened up new possibilities for AI applications, enabling models to simultaneously understand and generate content involving multiple modalities such as text, images, and videos. However, these models generally face a serious reliability issue—hallucination, where the model generates content that seems plausible but is actually inconsistent with the input information.
The hallucination problem is particularly prominent in multimodal scenarios because models need to integrate information from different modalities, and alignment and grounding between modalities are prone to deviations. For example, a model might add details that do not exist in an image when describing it, or its understanding of visual content may contradict the actual situation. This not only affects user experience but also poses serious risks in critical applications such as medical diagnosis and autonomous driving.