Section 01
[Introduction] Single-Method Safety Assessment Cannot Fully Detect Risks of Personality-Injected Large Language Models
The study found that two personality injection methods—prompt engineering and activation manipulation—exhibit completely different vulnerability patterns. Using only one method for testing will miss the model's main failure modes, so a multi-method comprehensive assessment is proposed. This article will analyze from aspects such as background, methodology, evidence, and conclusions.