Section 01
Introduction to the New Paradigm for Visual Language Model Evaluation: A Multi-Dimensional Auditing Framework Beyond Final Answer Accuracy
With the rapid development of Visual Language Models (VLMs) such as GPT-4V, Claude3, and Gemini, scientifically and comprehensively evaluating their multimodal capabilities has become an urgent task. Traditional evaluations only focus on final answer accuracy, ignoring key dimensions such as the degree of visual dependency, hallucination phenomena, and the consistency between generated content and image evidence. This article introduces an innovative multimodal reasoning auditing pipeline that achieves more comprehensive and in-depth evaluation of VLMs through visual dependency testing, hallucination detection, and claim-level faithfulness scoring.