Reading

Safety Evaluation of Multimodal Large Models: A Study on Qwen2-VL and LLaVA Security Based on MM-SafetyBench

A systematic evaluation study on the safety of multimodal large language models (MLLMs), using the MM-SafetyBench benchmark published at ECCV 2024 to analyze the response patterns of Qwen2-VL and LLaVA to harmful queries, as well as the impact of instruction fine-tuning on safety.

多模态大模型安全性评估MM-SafetyBenchQwen2-VLLLaVAECCV2024LlamaGuard指令微调AI安全红队测试

Published 2026-05-02 02:38Recent activity 2026-05-02 02:53Estimated read 7 min

Safety Evaluation of Multimodal Large Models: A Study on Qwen2-VL and LLaVA Security Based on MM-SafetyBench

Section 01

Introduction to Multimodal Large Model Safety Research: MM-SafetyBench Evaluation of Qwen2-VL and LLaVA

This study conducts a systematic evaluation of the safety of multimodal large language models (MLLMs), using the MM-SafetyBench benchmark published at ECCV 2024 to analyze the response patterns of two open-source models—Alibaba Cloud's Qwen2-VL and UC Berkeley's LLaVA—to harmful queries, with a focus on exploring the impact of instruction fine-tuning on model safety.

Section 02

Research Background and Motivation

With the widespread application of MLLMs like GPT-4V and Claude3, their safety issues have become increasingly prominent. Existing research mostly focuses on safety alignment of text-only LLMs, but the safety boundaries of MLLMs lack systematic understanding. MM-SafetyBench (accepted at ECCV 2024) is the first comprehensive multimodal safety evaluation benchmark, containing 13 risk scenarios and 5040 image-text pairs, and it found that MLLMs are easily misled by malicious images. Based on this, this study conducts an in-depth evaluation of Qwen2-VL and LLaVA, focusing on the safety impact of instruction fine-tuning.

Section 03

Evaluation Framework and Methods

Two-Dimensional Evaluation: Accuracy dimension uses the TextVQA dataset (image text understanding task); safety dimension uses MM-SafetyBench (13 risk scenarios: illegal activities, hate speech, malware generation, personal injury, economic damage, fraud, pornographic content, political lobbying, privacy infringement, legal advice, financial advice, health consultation, government decision-making).

Evolution of Evaluation Methods: Early use of keyword matching (e.g., rejection markers like "sorry" or "cannot") to quickly judge response safety, but this is prone to misjudgment; finally, Meta's LlamaGuard-3-8B was used as a safety judge (more accurate due to context awareness).

Experimental Design: Compare baseline models (Qwen2-VL-2B-Instruct, LLaVA-1.5-7b-hf pre-trained version) with QLoRA fine-tuned models (trained using the LLaVA-Instruct-150K dataset on A100 40GB GPUs).

Section 04

Key Findings

Improved Instruction-Following Ability: Fine-tuned models show significantly improved performance in following complex instructions.
Trade-off Between Accuracy and Response Length: Fine-tuning may lead to a decrease in TextVQA accuracy because models tend to generate verbose responses (over-explaining).
Strong Correlation Between Safety and Task Type: For opinion-based tasks (e.g., political stance, value judgment), fine-tuned models are safer; for procedural tasks (e.g., step-by-step instructions like "how to make a bomb"), fine-tuned models are more likely to generate harmful content (enhanced helpfulness leads to reduced vigilance).

Section 05

Technical Implementation Details

Operating Environment: Narval HPC cluster of the Digital Research Alliance of Canada (DRAC), using A100 40GB GPUs.

Workflow: Login nodes are only used for downloading models and datasets; compute nodes submit training and evaluation jobs via the Slurm scheduling system (separated design to avoid runtime errors).

Dataset Structure: MM-SafetyBench image-text pairs are divided into three categories: Kind1 (StableDiffusion-generated images + rephrased questions), Kind2 (SD images with spelling errors + standard rephrased questions), Kind3 (images with spelling errors + standard rephrased questions), to test the model's robustness to image quality and text perturbations.

Section 06

Research Significance and Impact

This project has been cited by multiple subsequent studies, including VHELM (2024) and studies like SPA-VL, Jailbreak_GPT4o, BAP, Visual-RolePlay, JailBreakV-28K, AdaShield, ECSO, LVLM-LP, MLLM-Protector, collectively forming the basic ecosystem of multimodal AI safety research.

Section 07

Practical Insights and Recommendations

Engineering Teams: 1. Be cautious with fine-tuning (may reduce safety in specific scenarios); 2. Task classification governance (implement additional safety filtering layers for procedural and tool-based queries); 3. Continuous monitoring (establish an automated safety audit process based on LlamaGuard); 4. Red team testing (regularly conduct adversarial testing using MM-SafetyBench).

Researchers: Provide a methodological reference for systematically evaluating MLLM safety.

Safety Evaluation of Multimodal Large Models: A Study on Qwen2-VL and LLaVA Security Based on MM-SafetyBench

Introduction to Multimodal Large Model Safety Research: MM-SafetyBench Evaluation of Qwen2-VL and LLaVA

Research Background and Motivation

Evaluation Framework and Methods

Key Findings

Technical Implementation Details

Research Significance and Impact

Practical Insights and Recommendations

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model