Section 01
Introduction: Study on the Unified Mechanism of Harmful Content Generation in Large Language Models
This study uses targeted weight pruning techniques to reveal that large language models' harmful content generation depends on a compact set of weights that are universal across types and separate from benign capabilities; safety alignment reshapes this set at the internal representation level to make it more compact; it also discovers the causal relationship between the ability to generate harmful content and the ability to recognize it, as well as between weight compression and emergent misalignment, providing a new theoretical basis and practical direction for AI safety intervention.