# Visual Input Backfires? Unexpected Findings of Multimodal Models in Lexical Judgment Tasks

> A new study found that adding real image context to vision-language models (VLMs) not only failed to improve the accuracy of lexical judgments but often impaired the consistency between model outputs and human ratings—especially when the visual evidence was less relevant. The research team uncovered the underlying mechanisms through probe analysis and attribution analysis, and proposed that simple instructions can alleviate this issue.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-26T17:24:59.000Z
- 最近活动: 2026-05-27T04:52:20.490Z
- 热度: 146.5
- 关键词: 视觉语言模型, 多模态学习, 词汇具体性, 意象性评分, 模型校准, 虚假相关性, 提示工程
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-27315v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-27315v1
- Markdown 来源: floors_fallback

---

## [Introduction] Visual Input Backfires? Unexpected Findings of Multimodal Models in Lexical Judgment

A new study found that adding real image context to vision-language models not only failed to improve the accuracy of lexical judgments but often impaired the consistency between model outputs and human ratings—especially when the visual evidence was less relevant. The research team uncovered the underlying mechanisms through probe analysis and attribution analysis, and proposed that simple instructions can alleviate this issue.

Source: Paper published on arXiv on May 26, 2026: *Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery* (Link: http://arxiv.org/abs/2605.27315v1)

## Research Background: The Visual Dependency Hypothesis of Multimodal Models

The rise of vision-language models (VLMs) marks a major leap in machine understanding capabilities, with the common assumption that visual input always enhances language comprehension. However, the research team raised a key question: Can VLMs distinguish between useful visual evidence and irrelevant image context? This is crucial for understanding and improving multimodal systems.

## Research Methods: Test Design for Lexical Concreteness and Imagery Ratings

The study uses human ratings of lexical "concreteness" and "imagery" (covering words from abstract ones like "freedom" to concrete ones like "apple") as an entry point to test the core hypothesis: Visual evidence should help when relevant, but may harm when irrelevant. Probe analysis and canonical correlation analysis are used to understand changes in model representations, and attribution analysis to track the path of visual input's influence.

## Key Findings: Real Images Impair Model Judgment Consistency

The results were unexpected: Real image context did not improve performance, but instead impaired consistency with human ratings—especially in the subset where visual evidence was least relevant. Specific findings include: 1. Model lexical representations shifted after introducing images, deviating from the true distribution; 2. Increased sensitivity to irrelevant visual features in images; 3. Decreased recoverability of target word attributes. This challenges the assumption that "more modalities are always better."

## Mechanism Analysis: Why Visual Input Interferes with Language Judgment

Key mechanisms of visual interference: 1. Instruction-tuned VLMs lack calibration for visual context relevance, making them unable to decide when to rely on or ignore visuals; 2. Visual representations dominate during fusion ("visual hegemony"); 3. Spurious correlations in training data are amplified.

## Solutions: Simple Text-Focused Instructions Alleviate Visual Interference

The study found that simple interventions work: Instructing the model to focus only on text content during inference can significantly reduce performance degradation caused by visual input—especially in vulnerable subsets. This indicates that the problem can be alleviated through prompt engineering without complex architectural modifications, pointing to the need for future models to dynamically adjust modal weights.

## Research Significance and Future Directions

Theoretical significance: Challenges the assumption of visual-enhanced language understanding and reveals the complex trade-offs in multimodal fusion; Practical significance: Suggests reducing visual dependence in abstract concept or text-intensive tasks; Future directions: Design mechanisms for automatic visual relevance assessment, achieve dynamic modal balance, and train models with relevance sensitivity.

## Conclusion: Multimodal Fusion Requires Fine Coordination—Less is More

Multimodal development is not a simple addition game; the fusion of vision and language requires fine coordination. The current performance of VLMs reveals that there is still a long way to go in building intelligent multimodal systems. Sometimes, letting the model focus on text is more effective than blindly adding visuals—this insight guides academic research and practical applications.