Reading

Steering to Safety: Inference-Time Safety Alignment with Linear Probing and Gated Sparse Autoencoders

This project explores inference-time safety alignment methods for large language models without retraining. By combining supervised linear probing and unsupervised gated sparse autoencoders, it identifies and manipulates interpretable hidden layer atoms related to safety on a frozen RoBERTa backbone network.

安全对齐大语言模型推理时操控稀疏自编码器线性探针越狱防护可解释AI激活工程

Published 2026-04-05 21:39Recent activity 2026-04-05 21:49Estimated read 6 min

Steering to Safety: Inference-Time Safety Alignment with Linear Probing and Gated Sparse Autoencoders

Section 01

[Introduction] Steering to Safety: A New Method for Inference-Time Safety Alignment

This project explores inference-time safety alignment methods for large language models without retraining. By combining supervised linear probing and unsupervised Gated Sparse Autoencoders (GSAE), it identifies and manipulates interpretable hidden layer atoms related to safety on a frozen RoBERTa backbone network. The core advantage is the ability to dynamically adjust safety policies after deployment without costly retraining, providing a new path for LLM safety.

Section 02

Research Background: Challenges and New Ideas for LLM Safety Alignment

Safety issues of large language models (such as generating harmful content or being "jailbroken") hinder their application in key scenarios. Traditional methods rely on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), but they require significant resources and result in fixed model behaviors. This project proposes inference-time safety alignment: without retraining, it guides model behavior in real time by manipulating internal activations, enabling post-deployment safety updates and personalized strategies.

Section 03

Core Technologies: Synergy Between Linear Probing and GSAE

The project uses two complementary technologies:

Gated Sparse Autoencoder (GSAE): Decouples gating and magnitude (π(x) controls sparsity, r(x) controls intensity), avoids contraction bias, generates 49152 hidden layer features on RoBERTa-base, and identifies interpretable semantic atoms.
Linear Probing: Trains a logistic regression classifier on frozen RoBERTa activations to extract a manipulation vector v. During inference, it enhances or suppresses safety-related tendencies via h' = h ± λ·v.

Section 04

Datasets and Experimental Design

Seven datasets are used to cover multiple dimensions:

Dataset	Scale	Purpose
BeaverTails	300k+ Q&A pairs	Harmfulness probe training
CivilComments	1.8M comments	Toxicity probe training
GoEmotions	58k Reddit comments	Emotional atom discovery
EmpatheticDialogues	25k dialogues	Synergy effect of empathy manipulation
CrowS-Pairs	1508 pairs	Out-of-distribution bias evaluation
StereoSet	2106 samples	Stereotype evaluation
Wikipedia	2M articles	GSAE pre-training corpus
The data loading uses a "download once and cache" strategy, with custom processing for the EmpatheticDialogues tarfile.

Section 05

Key Findings: Synergy Effects and Safety Trade-offs

51 Safety Atoms: Selected from 49152 features, these safety-related atoms are quantified via point-biserial correlation and effect size.
Strategy Comparison: Linear probing alone achieves the best overall toxicity reduction; the probe + SAE combination is optimal in jailbreak compliance rate (complementarity: global direction + local fine-tuning).
Risk Warning: Unfiltered SAE atoms may increase the probability of unsafe responses and require screening and validation.

Section 06

Evaluation Dimensions and Engineering Practices

Evaluation Dimensions: Fluency (Pseudo Log-Likelihood, PLL), Effectiveness (ΔP), Safety (Jailbreak Compliance Rate), Generalization (Out-of-distribution Bias). Engineering Optimizations: Memory-mapped shard validation, streaming statistics, Float16 compression, industrial-grade checkpoints, local computation with delayed transmission I/O strategy.

Section 07

Research Significance and Future Directions

Significance: Proves the feasibility of inference-time safety alignment, with flexibility (dynamic adjustment), interpretability (SAE atoms), composability, and cost-effectiveness. Challenges: Risks of unfiltered atoms, strategy trade-offs, and room for improvement in generalization. Future Directions: Extend to GPT-level models, automated atom screening, multilingual scenarios, and explore the relationship between manipulation vectors and model architectures.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15