Section 01
[Introduction] Visual Input Backfires? Unexpected Findings of Multimodal Models in Lexical Judgment
A new study found that adding real image context to vision-language models not only failed to improve the accuracy of lexical judgments but often impaired the consistency between model outputs and human ratings—especially when the visual evidence was less relevant. The research team uncovered the underlying mechanisms through probe analysis and attribution analysis, and proposed that simple instructions can alleviate this issue.
Source: Paper published on arXiv on May 26, 2026: Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery (Link: http://arxiv.org/abs/2605.27315v1)