Section 01
[Introduction] Steering to Safety: A New Method for Inference-Time Safety Alignment
This project explores inference-time safety alignment methods for large language models without retraining. By combining supervised linear probing and unsupervised Gated Sparse Autoencoders (GSAE), it identifies and manipulates interpretable hidden layer atoms related to safety on a frozen RoBERTa backbone network. The core advantage is the ability to dynamically adjust safety policies after deployment without costly retraining, providing a new path for LLM safety.