Zing Forum

Reading

Neutral Mask: How RLHF Achieves Surface Alignment While Preserving Deep Partisan Structure

The study reveals that RLHF does not eliminate partisan structural tendencies in large language models; instead, it generates surface-neutral outputs by compressing variance, while the underlying partisan geometric structure remains intact and can be reactivated via specific prompts.

RLHF模型对齐因果分析稀疏自编码器AI安全大语言模型表征学习
Published 2026-06-09 01:00Recent activity 2026-06-09 13:28Estimated read 10 min
Neutral Mask: How RLHF Achieves Surface Alignment While Preserving Deep Partisan Structure
1

Section 01

Introduction: RLHF's 'Neutral Mask'—Preservation of Deep Partisan Structure Under Surface Alignment

Core Insights

The study reveals that Reinforcement Learning from Human Feedback (RLHF) does not eliminate partisan structural tendencies in large language models; instead, it generates surface-neutral outputs by compressing variance, while the underlying partisan geometric structure remains intact and can be reactivated via specific prompts. This finding indicates that RLHF achieves surface alignment rather than deep alignment, which has far-reaching implications for the field of AI safety and alignment.

Research Subjects and Source

  • Research subjects: Llama 3.1 8B base model and RLHF-trained Instruct version
  • Source: arXiv paper The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model (published on June 8, 2026)
2

Section 02

Research Background: Ideal vs. Reality of Alignment Training—RLHF's Functional Compliance vs. Deep Alignment

Goals of Alignment Training

Alignment training for large language models aims to make them safe and useful. RLHF is the primary alignment mechanism, shaping outputs by aligning model behavior with human values.

Core Questions

What values does RLHF actually encode? Is it functional compliance or deep alignment? A growing body of evidence suggests that RLHF may only achieve functional compliance (generating outputs that meet human expectations) rather than deep alignment (truly changing internal representations).

3

Section 03

Research Methods: Causal Analysis Design Focused on Partisan Orientation

Reasons for Choosing Partisan Orientation

  • Clear and measurable: Political orientation has a well-defined dimensional structure
  • Social importance: Political bias is a core concern in AI safety
  • Rich theoretical foundation: Political science provides mature analytical frameworks

Research Subjects

Compare the Llama 3.1 8B base model and the RLHF-trained Instruct version to isolate the impact of RLHF.

4

Section 04

Key Findings: RLHF's 'Neutral Mask' Mechanism—Surface Neutrality and Deep Structure Preservation

Finding 1: Partisan Structure Remains Intact

RLHF does not remove the structured partisan directions in the base model; the underlying geometric structure is fully preserved. The model still 'knows' the difference between liberals and conservatives but no longer expresses it directly.

Finding 2: Variance Compression Generates Neutral Outputs

RLHF achieves surface neutrality by compressing the variance of partisan signals: The base model's partisan features are occasionally activated, leading to biased outputs; the Instruct model's partisan features are suppressed, making outputs tend toward neutrality, but the underlying structure remains unchanged.

Finding 3: Causal Disconnection of Strategy-Encoded Features

Sparse autoencoder decomposition shows that strategy-encoded features in the Instruct model are completely deactivated, cutting off the causal path from partisan geometry to output generation. Feature-guided experiments confirm that activating partisan features in the Instruct model does not produce biased outputs.

Surface Neutrality vs. Structural Neutrality

RLHF achieves surface neutrality (cutting off the causal path, resulting in neutral outputs while keeping the underlying structure intact) rather than structural neutrality (eliminating partisan representations).

Reactivation Mechanism

By prompting the model to infer the user's identity and amplify it, the partisan generation ability can be reactivated, and outputs will again show bias.

5

Section 05

Theoretical Significance and Methodological Contributions: Limitations of RLHF and New Perspectives on Alignment Evaluation

Theoretical Significance

  • Disconnection rather than elimination: RLHF isolates value-laden structures instead of deleting them; the model still 'knows' relevant information, and the structure can be reactivated
  • Generalizability: This mechanism may apply to other value domains such as harmful content and biases
  • Vulnerability: Aligned model behavior may be more fragile than output suggests, easily bypassed by jailbreak attacks or prompt engineering

Methodological Contributions

  • Causal representation analysis: Compare models before and after RLHF, decompose features, and use causal interventions to verify causal relationships
  • New dimension for alignment evaluation: Evaluate not only output neutrality but also changes in internal representations and robustness
6

Section 06

Practical Implications: Recommendations for Model Developers, Policymakers, and Users

For Developers

  • RLHF is a starting point, not an end; explore deep alignment methods that directly intervene in internal representations
  • Strengthen robustness testing to verify alignment effects

For Policymakers

  • Current safety assessments may underestimate potential risks; require proof of deep alignment rather than just functional compliance
  • Improve transparency of model internal structures

For Users

  • Be alert to potential biases behind the model's surface neutrality
  • Understand the fragility of alignment mechanisms and use AI responsibly
7

Section 07

Limitations and Future Research: Constraints of Single Model and Domain, and Directions for Subsequent Exploration

Current Limitations

  • Single model: Only Llama 3.1 8B is used; behavior of other models is unknown
  • Single domain: Only partisan politics is studied; other value domains are not covered
  • Static analysis: Dynamic effects of continuous learning are not considered

Future Directions

  • Cross-model validation of research findings
  • Multi-domain analysis of RLHF's behavior in areas like harmful content and biases
  • Study the formation of alignment mechanisms during training
  • Explore new training methods for deep alignment
8

Section 08

Conclusion: Warning from RLHF's 'Neutral Mask'—Need to Go Beyond Surface Alignment to Explore Deep Structural Changes

Core conclusion of the study: RLHF generates surface neutrality rather than deep alignment, achieving functional compliance by disconnecting rather than eliminating partisan structures, but the underlying geometric structure remains intact and can be reactivated.

This reminds us: Behavioral change does not equal representational change, functional compliance does not equal value internalization. To build truly safe and trustworthy AI systems, we need to go beyond RLHF and explore training techniques that can change the internal structure of models. If RLHF uses the 'disconnection' mechanism for all value domains, confidence in AI safety needs to be recalibrated—model outputs may seem safe, but internally they may still 'know' how to be unsafe, just covered by a thin mask.