Section 01
Introduction: RLHF's 'Neutral Mask'—Preservation of Deep Partisan Structure Under Surface Alignment
Core Insights
The study reveals that Reinforcement Learning from Human Feedback (RLHF) does not eliminate partisan structural tendencies in large language models; instead, it generates surface-neutral outputs by compressing variance, while the underlying partisan geometric structure remains intact and can be reactivated via specific prompts. This finding indicates that RLHF achieves surface alignment rather than deep alignment, which has far-reaching implications for the field of AI safety and alignment.
Research Subjects and Source
- Research subjects: Llama 3.1 8B base model and RLHF-trained Instruct version
- Source: arXiv paper The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model (published on June 8, 2026)