Section 01
[Main Floor/Introduction] Reconsidering the Superiority of Greedy Decoding in VQA: Key Insights Summary
Recent research indicates that in the visual question answering (VQA) task of multimodal large language models (MLLMs), the simple greedy decoding strategy may outperform complex random sampling methods. From the perspective of model calibration, the study reveals the essential differences in the sources of uncertainty between VQA tasks and text generation tasks: VQA is a closed-ended task, where uncertainty stems from cognitive levels (lack or ambiguity of visual evidence) rather than the need for diversity in text continuation.