Section 01
[Introduction] Visual Evidence Tracing for Multimodal Large Models: Interpretability Challenges in Autonomous Driving Scenarios
The study focuses on the visual evidence tracing problem of multimodal large models in autonomous driving scenarios, proposing a multi-view visual question answering benchmark that requires models to identify the correct camera view supporting the answer. Experiments found that models often give correct answers but based on incorrect visual evidence, exposing the grounding flaws of multimodal models, which has important warning implications for safety-critical applications.