Section 01
[Introduction] ProjLens Reveals Core Mechanisms of Backdoor Attacks in Projection Layers of Multimodal Large Models
ProjLens is an interpretability framework for Multimodal Large Language Models (MLLMs), designed to reveal backdoor attack mechanisms in their projection layers. Key research findings include: even normal downstream task fine-tuning of only the projection layer can introduce backdoor injection vulnerabilities; backdoor parameters are encoded in the low-rank subspace of the projection layer, with no dedicated trigger neurons; the activation mechanism relies on a linear relationship between the magnitude of semantic shift and input norm—poisoned samples trigger the backdoor due to their large norm. These findings provide critical basis for MLLM security defense.