Section 01
Introduction: VISAGE Framework Suppresses Hallucinations in Multimodal Large Models
This article introduces VISAGE, a training-free decoding framework for multimodal diffusion large language models. It effectively mitigates multimodal hallucinations by quantifying the spatial entropy of cross-attention distributions to penalize token choices lacking visual grounding. Addressing the flaw of traditional models' objective mismatch (only considering text likelihood while ignoring visual support), this framework calibrates the objective function during inference to enhance the model's fidelity to visual content.