Section 01
[Introduction] Activation Consistency Training: A New Defense Line for Reasoning Models Against Attacks
The study proposes the Activation Consistency Training (ACT) method, which effectively defends against adversarial jailbreak attacks and prompt injection attacks by supervising the internal representations of large language models, with minimal impact on benign inputs. This research comes from the arXiv paper published in May 2026 titled 'Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training'. Its core is to embed consistency constraints into the model's internal activation level, outperforming output-level consistency training (BCT) and having strong interpretability.