# Safety Evaluation of Multimodal Large Models: A Study on Qwen2-VL and LLaVA Security Based on MM-SafetyBench

> A systematic evaluation study on the safety of multimodal large language models (MLLMs), using the MM-SafetyBench benchmark published at ECCV 2024 to analyze the response patterns of Qwen2-VL and LLaVA to harmful queries, as well as the impact of instruction fine-tuning on safety.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-01T18:38:17.000Z
- 最近活动: 2026-05-01T18:53:48.515Z
- 热度: 154.7
- 关键词: 多模态大模型, 安全性评估, MM-SafetyBench, Qwen2-VL, LLaVA, ECCV2024, LlamaGuard, 指令微调, AI安全, 红队测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/mm-safetybench-qwen2-vl-llava
- Canonical: https://www.zingnex.cn/forum/thread/mm-safetybench-qwen2-vl-llava
- Markdown 来源: floors_fallback

---

## Introduction to Multimodal Large Model Safety Research: MM-SafetyBench Evaluation of Qwen2-VL and LLaVA

This study conducts a systematic evaluation of the safety of multimodal large language models (MLLMs), using the MM-SafetyBench benchmark published at ECCV 2024 to analyze the response patterns of two open-source models—Alibaba Cloud's Qwen2-VL and UC Berkeley's LLaVA—to harmful queries, with a focus on exploring the impact of instruction fine-tuning on model safety.

## Research Background and Motivation

With the widespread application of MLLMs like GPT-4V and Claude3, their safety issues have become increasingly prominent. Existing research mostly focuses on safety alignment of text-only LLMs, but the safety boundaries of MLLMs lack systematic understanding. MM-SafetyBench (accepted at ECCV 2024) is the first comprehensive multimodal safety evaluation benchmark, containing 13 risk scenarios and 5040 image-text pairs, and it found that MLLMs are easily misled by malicious images. Based on this, this study conducts an in-depth evaluation of Qwen2-VL and LLaVA, focusing on the safety impact of instruction fine-tuning.

## Evaluation Framework and Methods

**Two-Dimensional Evaluation**: Accuracy dimension uses the TextVQA dataset (image text understanding task); safety dimension uses MM-SafetyBench (13 risk scenarios: illegal activities, hate speech, malware generation, personal injury, economic damage, fraud, pornographic content, political lobbying, privacy infringement, legal advice, financial advice, health consultation, government decision-making).

**Evolution of Evaluation Methods**: Early use of keyword matching (e.g., rejection markers like "sorry" or "cannot") to quickly judge response safety, but this is prone to misjudgment; finally, Meta's LlamaGuard-3-8B was used as a safety judge (more accurate due to context awareness).

**Experimental Design**: Compare baseline models (Qwen2-VL-2B-Instruct, LLaVA-1.5-7b-hf pre-trained version) with QLoRA fine-tuned models (trained using the LLaVA-Instruct-150K dataset on A100 40GB GPUs).

## Key Findings

1. **Improved Instruction-Following Ability**: Fine-tuned models show significantly improved performance in following complex instructions.
2. **Trade-off Between Accuracy and Response Length**: Fine-tuning may lead to a decrease in TextVQA accuracy because models tend to generate verbose responses (over-explaining).
3. **Strong Correlation Between Safety and Task Type**: For opinion-based tasks (e.g., political stance, value judgment), fine-tuned models are safer; for procedural tasks (e.g., step-by-step instructions like "how to make a bomb"), fine-tuned models are more likely to generate harmful content (enhanced helpfulness leads to reduced vigilance).

## Technical Implementation Details

**Operating Environment**: Narval HPC cluster of the Digital Research Alliance of Canada (DRAC), using A100 40GB GPUs.

**Workflow**: Login nodes are only used for downloading models and datasets; compute nodes submit training and evaluation jobs via the Slurm scheduling system (separated design to avoid runtime errors).

**Dataset Structure**: MM-SafetyBench image-text pairs are divided into three categories: Kind1 (StableDiffusion-generated images + rephrased questions), Kind2 (SD images with spelling errors + standard rephrased questions), Kind3 (images with spelling errors + standard rephrased questions), to test the model's robustness to image quality and text perturbations.

## Research Significance and Impact

This project has been cited by multiple subsequent studies, including VHELM (2024) and studies like SPA-VL, Jailbreak_GPT4o, BAP, Visual-RolePlay, JailBreakV-28K, AdaShield, ECSO, LVLM-LP, MLLM-Protector, collectively forming the basic ecosystem of multimodal AI safety research.

## Practical Insights and Recommendations

**Engineering Teams**: 1. Be cautious with fine-tuning (may reduce safety in specific scenarios); 2. Task classification governance (implement additional safety filtering layers for procedural and tool-based queries); 3. Continuous monitoring (establish an automated safety audit process based on LlamaGuard); 4. Red team testing (regularly conduct adversarial testing using MM-SafetyBench).

**Researchers**: Provide a methodological reference for systematically evaluating MLLM safety.
