Zing Forum

Reading

Safety Evaluation of Multimodal Large Models: A Study on Qwen2-VL and LLaVA Security Based on MM-SafetyBench

A systematic evaluation study on the safety of multimodal large language models (MLLMs), using the MM-SafetyBench benchmark published at ECCV 2024 to analyze the response patterns of Qwen2-VL and LLaVA to harmful queries, as well as the impact of instruction fine-tuning on safety.

多模态大模型安全性评估MM-SafetyBenchQwen2-VLLLaVAECCV2024LlamaGuard指令微调AI安全红队测试
Published 2026-05-02 02:38Recent activity 2026-05-02 02:53Estimated read 7 min
Safety Evaluation of Multimodal Large Models: A Study on Qwen2-VL and LLaVA Security Based on MM-SafetyBench
1

Section 01

Introduction to Multimodal Large Model Safety Research: MM-SafetyBench Evaluation of Qwen2-VL and LLaVA

This study conducts a systematic evaluation of the safety of multimodal large language models (MLLMs), using the MM-SafetyBench benchmark published at ECCV 2024 to analyze the response patterns of two open-source models—Alibaba Cloud's Qwen2-VL and UC Berkeley's LLaVA—to harmful queries, with a focus on exploring the impact of instruction fine-tuning on model safety.

2

Section 02

Research Background and Motivation

With the widespread application of MLLMs like GPT-4V and Claude3, their safety issues have become increasingly prominent. Existing research mostly focuses on safety alignment of text-only LLMs, but the safety boundaries of MLLMs lack systematic understanding. MM-SafetyBench (accepted at ECCV 2024) is the first comprehensive multimodal safety evaluation benchmark, containing 13 risk scenarios and 5040 image-text pairs, and it found that MLLMs are easily misled by malicious images. Based on this, this study conducts an in-depth evaluation of Qwen2-VL and LLaVA, focusing on the safety impact of instruction fine-tuning.

3

Section 03

Evaluation Framework and Methods

Two-Dimensional Evaluation: Accuracy dimension uses the TextVQA dataset (image text understanding task); safety dimension uses MM-SafetyBench (13 risk scenarios: illegal activities, hate speech, malware generation, personal injury, economic damage, fraud, pornographic content, political lobbying, privacy infringement, legal advice, financial advice, health consultation, government decision-making).

Evolution of Evaluation Methods: Early use of keyword matching (e.g., rejection markers like "sorry" or "cannot") to quickly judge response safety, but this is prone to misjudgment; finally, Meta's LlamaGuard-3-8B was used as a safety judge (more accurate due to context awareness).

Experimental Design: Compare baseline models (Qwen2-VL-2B-Instruct, LLaVA-1.5-7b-hf pre-trained version) with QLoRA fine-tuned models (trained using the LLaVA-Instruct-150K dataset on A100 40GB GPUs).

4

Section 04

Key Findings

  1. Improved Instruction-Following Ability: Fine-tuned models show significantly improved performance in following complex instructions.
  2. Trade-off Between Accuracy and Response Length: Fine-tuning may lead to a decrease in TextVQA accuracy because models tend to generate verbose responses (over-explaining).
  3. Strong Correlation Between Safety and Task Type: For opinion-based tasks (e.g., political stance, value judgment), fine-tuned models are safer; for procedural tasks (e.g., step-by-step instructions like "how to make a bomb"), fine-tuned models are more likely to generate harmful content (enhanced helpfulness leads to reduced vigilance).
5

Section 05

Technical Implementation Details

Operating Environment: Narval HPC cluster of the Digital Research Alliance of Canada (DRAC), using A100 40GB GPUs.

Workflow: Login nodes are only used for downloading models and datasets; compute nodes submit training and evaluation jobs via the Slurm scheduling system (separated design to avoid runtime errors).

Dataset Structure: MM-SafetyBench image-text pairs are divided into three categories: Kind1 (StableDiffusion-generated images + rephrased questions), Kind2 (SD images with spelling errors + standard rephrased questions), Kind3 (images with spelling errors + standard rephrased questions), to test the model's robustness to image quality and text perturbations.

6

Section 06

Research Significance and Impact

This project has been cited by multiple subsequent studies, including VHELM (2024) and studies like SPA-VL, Jailbreak_GPT4o, BAP, Visual-RolePlay, JailBreakV-28K, AdaShield, ECSO, LVLM-LP, MLLM-Protector, collectively forming the basic ecosystem of multimodal AI safety research.

7

Section 07

Practical Insights and Recommendations

Engineering Teams: 1. Be cautious with fine-tuning (may reduce safety in specific scenarios); 2. Task classification governance (implement additional safety filtering layers for procedural and tool-based queries); 3. Continuous monitoring (establish an automated safety audit process based on LlamaGuard); 4. Red team testing (regularly conduct adversarial testing using MM-SafetyBench).

Researchers: Provide a methodological reference for systematically evaluating MLLM safety.