Zing Forum

Reading

A New Method for LLM Activation Steering Based on Linear Optimal Control

Researchers found that large language models exhibit local linearity in inter-layer dynamics, and based on this, proposed a closed-loop activation steering method using Linear Quadratic Regulators (LQR), which outperforms existing baselines in tasks such as toxicity control and factuality adjustment.

激活引导线性二次调节器大语言模型对齐闭环控制Transformer模型安全推理时干预
Published 2026-04-21 11:09Recent activity 2026-04-22 12:35Estimated read 4 min
A New Method for LLM Activation Steering Based on Linear Optimal Control
1

Section 01

A New Method for LLM Activation Steering Based on Linear Optimal Control

Researchers found that large language models (LLMs) have local linearity in inter-layer dynamics. Based on this, they proposed a closed-loop activation steering method using Linear Quadratic Regulators (LQR). This method can intervene in model behavior during inference without fine-tuning, outperforms existing baselines in tasks like toxicity control and factuality adjustment, and has both theoretical guarantees and practical deployment value.

2

Section 02

Background: Challenges in LLM Alignment and Limitations of Activation Steering

Traditional LLM alignment relies on fine-tuning methods like RLHF, which are costly and hard to adjust flexibly. Activation steering emerged as an inference-time intervention technique, but existing methods are mostly open-loop control, lacking feedback mechanisms, which easily amplifies intervention errors and limits effectiveness.

3

Section 03

Key Finding: Local Linearity in Transformer Inter-Layer Dynamics

Empirical studies found that although Transformers are nonlinear systems overall, the dynamic changes between layers can be well approximated by local linear models. This property allows the use of classical control theory tools to manipulate the internal dynamics of the model.

4

Section 04

Method: LQR Closed-Loop Activation Steering and Adaptive Setpoints

The LLM inference process is modeled as a linear time-varying system, and the LQR framework is introduced: the state corresponds to the layer activation vector, the control input is the activation intervention amount, and the target is the desired semantic direction. A feedback controller is computed using hierarchical Jacobian matrices to achieve closed-loop adjustment. Additionally, an adaptive semantic setpoint is proposed, which can dynamically adjust the target state based on context.

5

Section 05

Experimental Evidence: Outperforming Baselines Across Multiple Tasks

In tasks such as toxicity control (reducing harm while maintaining fluency), factuality adjustment (reducing hallucinations), refusal behavior regulation (balancing safety and usefulness), and arbitrary concept manipulation, the LQR method consistently outperforms existing activation steering baselines.

6

Section 06

Theoretical Guarantees and Practical Deployment Advantages

The LQR method provides theoretical bounds for setpoint tracking errors. Computationally, it requires no offline training, has minimal overhead, and can be plug-and-play integrated into existing inference pipelines.

7

Section 07

Implications and Future Outlook

This study bridges control theory and deep learning, revealing the concise mathematical structure of complex AI systems. In the future, it can be extended to multimodal models, explore more complex adaptive mechanisms, and expand theoretical guarantees to more scenarios.