# Steering to Safety: Inference-Time Safety Alignment with Linear Probing and Gated Sparse Autoencoders

> This project explores inference-time safety alignment methods for large language models without retraining. By combining supervised linear probing and unsupervised gated sparse autoencoders, it identifies and manipulates interpretable hidden layer atoms related to safety on a frozen RoBERTa backbone network.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-05T13:39:54.000Z
- 最近活动: 2026-04-05T13:49:58.639Z
- 热度: 150.8
- 关键词: 安全对齐, 大语言模型, 推理时操控, 稀疏自编码器, 线性探针, 越狱防护, 可解释AI, 激活工程
- 页面链接: https://www.zingnex.cn/en/forum/thread/steering-to-safety
- Canonical: https://www.zingnex.cn/forum/thread/steering-to-safety
- Markdown 来源: floors_fallback

---

## [Introduction] Steering to Safety: A New Method for Inference-Time Safety Alignment

This project explores inference-time safety alignment methods for large language models without retraining. By combining supervised linear probing and unsupervised Gated Sparse Autoencoders (GSAE), it identifies and manipulates interpretable hidden layer atoms related to safety on a frozen RoBERTa backbone network. The core advantage is the ability to dynamically adjust safety policies after deployment without costly retraining, providing a new path for LLM safety.

## Research Background: Challenges and New Ideas for LLM Safety Alignment

Safety issues of large language models (such as generating harmful content or being "jailbroken") hinder their application in key scenarios. Traditional methods rely on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), but they require significant resources and result in fixed model behaviors. This project proposes **inference-time safety alignment**: without retraining, it guides model behavior in real time by manipulating internal activations, enabling post-deployment safety updates and personalized strategies.

## Core Technologies: Synergy Between Linear Probing and GSAE

The project uses two complementary technologies:
1. **Gated Sparse Autoencoder (GSAE)**: Decouples gating and magnitude (π(x) controls sparsity, r(x) controls intensity), avoids contraction bias, generates 49152 hidden layer features on RoBERTa-base, and identifies interpretable semantic atoms.
2. **Linear Probing**: Trains a logistic regression classifier on frozen RoBERTa activations to extract a manipulation vector v. During inference, it enhances or suppresses safety-related tendencies via h' = h ± λ·v.

## Datasets and Experimental Design

Seven datasets are used to cover multiple dimensions:
| Dataset | Scale | Purpose |
|--------|------|------|
| BeaverTails | 300k+ Q&A pairs | Harmfulness probe training |
| CivilComments | 1.8M comments | Toxicity probe training |
| GoEmotions | 58k Reddit comments | Emotional atom discovery |
| EmpatheticDialogues | 25k dialogues | Synergy effect of empathy manipulation |
| CrowS-Pairs | 1508 pairs | Out-of-distribution bias evaluation |
| StereoSet | 2106 samples | Stereotype evaluation |
| Wikipedia | 2M articles | GSAE pre-training corpus |
The data loading uses a "download once and cache" strategy, with custom processing for the EmpatheticDialogues tarfile.

## Key Findings: Synergy Effects and Safety Trade-offs

1. **51 Safety Atoms**: Selected from 49152 features, these safety-related atoms are quantified via point-biserial correlation and effect size.
2. **Strategy Comparison**: Linear probing alone achieves the best overall toxicity reduction; the probe + SAE combination is optimal in jailbreak compliance rate (complementarity: global direction + local fine-tuning).
3. **Risk Warning**: Unfiltered SAE atoms may increase the probability of unsafe responses and require screening and validation.

## Evaluation Dimensions and Engineering Practices

**Evaluation Dimensions**: Fluency (Pseudo Log-Likelihood, PLL), Effectiveness (ΔP), Safety (Jailbreak Compliance Rate), Generalization (Out-of-distribution Bias).
**Engineering Optimizations**: Memory-mapped shard validation, streaming statistics, Float16 compression, industrial-grade checkpoints, local computation with delayed transmission I/O strategy.

## Research Significance and Future Directions

**Significance**: Proves the feasibility of inference-time safety alignment, with flexibility (dynamic adjustment), interpretability (SAE atoms), composability, and cost-effectiveness.
**Challenges**: Risks of unfiltered atoms, strategy trade-offs, and room for improvement in generalization.
**Future Directions**: Extend to GPT-level models, automated atom screening, multilingual scenarios, and explore the relationship between manipulation vectors and model architectures.
