# Visual Evidence Tracing for Multimodal Large Models: Interpretability Challenges in Autonomous Driving Scenarios

> The study proposes a multi-view visual question answering benchmark that requires models to identify the correct camera view supporting the answer. Experiments show that models often provide reasonable answers but based on incorrect visual evidence, exposing the grounding flaws of multimodal models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-08T15:39:06.000Z
- 最近活动: 2026-06-09T03:52:49.815Z
- 热度: 125.8
- 关键词: 多模态大模型, 视觉证据溯源, 自动驾驶, 可解释性, 视觉问答, grounding
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2606-09644v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2606-09644v1
- Markdown 来源: floors_fallback

---

## [Introduction] Visual Evidence Tracing for Multimodal Large Models: Interpretability Challenges in Autonomous Driving Scenarios

The study focuses on the visual evidence tracing problem of multimodal large models in autonomous driving scenarios, proposing a multi-view visual question answering benchmark that requires models to identify the correct camera view supporting the answer. Experiments found that models often give correct answers but based on incorrect visual evidence, exposing the grounding flaws of multimodal models, which has important warning implications for safety-critical applications.

## Background: Correct Answer ≠ Correct Reasoning, Special Challenges in Autonomous Driving Scenarios

Multimodal Large Language Models (MLLMs) have achieved impressive results in visual reasoning benchmarks, but a core issue is overlooked: does the model really 'look' at the right place when giving a correct answer? In autonomous driving multi-view scenarios, vehicles are equipped with multiple cameras (e.g., six synchronized views in the NuScenes dataset). Models may guess the correct answer based on wrong views (such as reflections/shadows from side-view cameras). While these answers are indistinguishable at the answer level, the safety implications are vastly different.

## Methodology: Multi-View Visual Question Answering Benchmark Design and Evaluation Setup

### Benchmark Design
The study constructs a multi-view visual question answering benchmark. Core task: Given six synchronized camera views from NuScenes and a question, the model must simultaneously identify the correct camera view and answer the question. Data construction uses automatic conflict mining + manual verification, containing 122 conflicting question-answer pairs (73 scenarios, covering causal/counterfactual reasoning and other types), ensuring each sample has a clear 'golden view'.

### Evaluation Setup
1. **View Selection Setup**: Evaluate only the ability to select the correct camera view;
2. **Oracle QA Setup**: Assume the golden view is known, evaluate the QA ability under that view;
3. **Joint Prediction Setup**: Select the view and answer the question simultaneously (closest to real-world applications).

Answer evaluation: Exact match for structured answers; LLM-based judgment for open-ended answers.

## Evidence: Grounding Failures Are Prevalent, Models Rely on 'Informed Guesses'

The benchmark explicitly separates visual source identification from answer correctness, exposing grounding failures that cannot be detected by answer-only evaluation: Models may give correct answers in joint prediction, but the selected view has no causal relationship with the answer—meaning the model makes 'informed guesses' rather than true visual reasoning.

## Conclusion: Safety-Critical Applications Need to Emphasize Evidence Tracing, Not Just Accuracy

The study warns: In safety-critical applications like autonomous driving, we cannot trust decisions just because models perform well on test sets; we must ensure decisions are based on correct visual evidence.

## Recommendations: Future Research Directions and Technical Insights

Future research directions:
1. Develop multimodal architectures that explicitly model visual attention;
2. Design training objectives that encourage models to generate answers based on correct visual evidence;
3. Build more fine-grained evaluation metrics to quantify the causal relationship between visual evidence and answers.

Practical application insights: While pursuing accuracy, we need to equally emphasize interpretability and evidence tracing capabilities.