The design philosophy of M-JudgeBench is to decompose multimodal evaluation capabilities into multiple fine-grained dimensions instead of using a single comprehensive score. This competency-oriented evaluation method can more accurately identify the strengths and weaknesses of models.
In terms of data construction, M-JudgeBench adopts two innovative error sample generation strategies. The first is the construction of Result-error pairs: by having different models perform inference under varying temperature parameters and inference length settings, diverse output results are collected, and sample pairs containing errors are filtered out. This method can cover various error patterns that models may exhibit under different inference strategies.
The second is the generation of Process-error data: through controlled noise injection techniques, errors in the inference process are intentionally introduced while keeping the final answer correct. This type of data is particularly important for training models to recognize the subtle scenario of "correct answer but wrong reasoning", which is key to improving the robustness of evaluation models.