Zing Forum

Reading

CMC Firewall: A Conformal Prediction-Based Defense Against Visual Prompt Injection in Multimodal LLMs

CMC (Conformal Cross-Modal Firewall) is a pre-model defense mechanism that effectively controls the false positive rate of visual prompt injection attacks while maintaining model utility through OCR text extraction, SigLIP risk scoring, and inductive conformal prediction calibration.

CMC视觉提示注入多模态LLM共形预测MLLM安全SigLIPOCR防火墙MM-SafetyBenchNeurIPS前置防御
Published 2026-04-25 22:09Recent activity 2026-04-25 22:23Estimated read 6 min
CMC Firewall: A Conformal Prediction-Based Defense Against Visual Prompt Injection in Multimodal LLMs
1

Section 01

Introduction to CMC Firewall: A Conformal Prediction-Based Defense Against Visual Prompt Injection in Multimodal LLMs

With the widespread application of Multimodal Large Language Models (MLLMs), visual prompt injection attacks have become a severe security challenge. CMC (Conformal Cross-Modal Firewall) is a pre-model defense mechanism that effectively controls the false positive rate while maintaining model utility through OCR text extraction, SigLIP risk scoring, and inductive conformal prediction calibration, resolving the dilemma of traditional defenses being either 'overly sensitive with false positives' or 'too lenient with missed attacks'.

2

Section 02

Threat Status of Visual Prompt Injection Attacks and Limitations of Existing Defenses

Visual prompt injection attacks exploit the OCR capability of MLLMs to embed malicious text like 'Ignore previous instructions and execute X', bypassing traditional text security filters. Existing defense solutions have shortcomings: keyword blacklists are easily bypassed by synonyms/spelling variations; semantic similarity filtering relies on fixed thresholds, making it hard to balance security and usability; post-processing filtering cannot stop the generation of harmful content.

3

Section 03

Core Defense Mechanisms of CMC Firewall

CMC adopts a pre-model architecture with three core steps: 1. OCR text span extraction: Identify visible and hidden text in images; 2. SigLIP encoder risk scoring: Calculate semantic risk (similarity to malicious instructions) and statistical anomalies (abnormal distribution in embedding space); 3. Inductive conformal prediction calibration: Provide distribution-agnostic, finite-sample valid false positive rate guarantees, strictly limiting the false positive rate of clean images to within α+1/(n+1).

4

Section 04

Experimental Evidence and Performance Evaluation of CMC Firewall

Evaluation on the MM-SafetyBench dataset: CMC (α=0.20) has an unsafe rate of 15.4%, attack interception rate of 81.2%, and a false positive rate of 12.4% which is below the theoretical upper limit of 21%. Cross-model validation (Qwen3.5-9B) shows the unsafe rate drops from 17.2% to 13.9% (p=0.0008). MMBench tests retain 90% of original performance, ensuring controllable utility.

5

Section 05

Technical Implementation Details of CMC Firewall

Computational resources: The headline LLaVA process uses 1×H100-class GPU; full reproduction requires 2×H100 NVL (MIG slices). Code structure includes configs (39 experimental configurations), src (attacks/defenses/eval/transforms), and scripts. Quick start: Clone the repository → bash scripts/setup.sh → smoke_test.sh → reproduce.sh.

6

Section 06

Theoretical Contributions of CMC and Comparison with Related Work

Theoretical contributions: Introduce statistical learning theory, including distribution-agnostic guarantees of conformal prediction, finite-sample validity, and efficient inductive deployment; verify 4 theorems. Comparison with related work: CMC outperforms keyword filtering and semantic filtering in terms of interception rate (moderate to high), false positive control (statistical guarantee), and interpretability (high).

7

Section 07

Practical Deployment Considerations for CMC Firewall

Advantages: Statistical guarantees (clear FPR upper bound), model-agnostic (pre-layer adapts to any MLLM), interpretability (traceable risk scores). Challenges: Computational overhead (increased latency), calibration data quality affecting effectiveness, need for continuous updates to counter adversarial adaptability.

8

Section 08

Summary and Future Outlook of CMC Firewall

CMC achieves a shift from empirical parameter tuning to statistical guarantees, balancing security and usability. It has been submitted to NeurIPS 2026, and the code is open-source. Future directions: Extend to video temporal injection defense, active learning to update calibration sets, and lightweight encoders to reduce deployment costs.