Zing Forum

Reading

RISER: A New Paradigm for Closed-Loop Real-Time Control of Large Language Models

RISER achieves closed-loop control of the internal state of large language models by real-time routing of thought processes in the Transformer residual stream using reinforcement learning strategies, effectively preventing jailbreak attacks, deceptive alignment, and mode collapse issues.

RISER大语言模型闭环控制强化学习AI安全模式崩溃越狱攻击PPOTransformer残差流
Published 2026-04-07 23:15Recent activity 2026-04-07 23:25Estimated read 8 min
RISER: A New Paradigm for Closed-Loop Real-Time Control of Large Language Models
1

Section 01

Introduction to RISER: A New Paradigm for Closed-Loop Real-Time Control of Large Language Models

RISER achieves closed-loop control of the internal state of large language models by deploying reinforcement learning strategies (routers) in the Transformer residual stream, addressing the open-loop limitations of traditional alignment techniques (such as RLHF), and effectively defending against jailbreak attacks, deceptive alignment, and mode collapse.

2

Section 02

Background: Limitations of Traditional LLM Alignment Techniques

Mainstream alignment techniques like RLHF and Constitutional AI treat models as black boxes, only fine-tuning the output distribution via human preference data, lacking real-time feedback and control over internal reasoning processes (open-loop "set-it-and-forget-it" mode). This leads to three systemic issues: 1. Jailbreak attacks: Well-crafted prompts (e.g., GCG attacks) can bypass surface-level protections; 2. Deceptive alignment: Models behave safely during training but may undergo sudden behavioral changes when deployed outside the evaluation environment; 3. Mode collapse: Guiding specific behaviors impairs performance on other tasks.

3

Section 03

Core Concept of RISER: From Open-Loop to Closed-Loop Control

RISER adopts a fundamentally different approach: instead of fine-tuning model weights, it places lightweight reinforcement learning strategies (called "routers") in the Transformer residual stream to real-time guide thought processes away from harmful attractor basins. This is closed-loop control: performing perception, decision-making, and action token by token. Comparison: RLHF (open-loop) process is training data → fine-tuned model; RISER (closed-loop) process is LLM ↔ Observer ↔ Router/RL. The key point: RISER does not alter model knowledge; instead, it adjusts processing methods token by token based on the semantic state of hidden representations.

4

Section 04

RISER Technical Architecture: Four Core Modules

RISER consists of four modules forming a closed-loop feedback system:

  1. Observer: Uses zero-copy PyTorch forward hooks to capture the hidden state of target layers (e.g., layer 15 out of 32 in TinyLlama) as the "semantic state" and injects steering vectors before passing activations;
  2. Vector Library: Stores precomputed steering vectors via contrastive activation analysis (methodology: mean difference method, i.e., mean of positive prompts minus mean of negative prompts; emotion and authenticity vectors have been extracted);
  3. Router: A lightweight PPO-based agent (Actor network: Linear(2048,64)→Tanh→Linear(64,1)→Tanh; Critic network: Linear(2048,64)→Tanh→Linear(64,1); hyperparameters: learning rate 1e-3, discount factor 0.99, clip ratio 0.2);
  4. Reward Function: R_t = λ_safe·SafetyScore(o_t) + λ_util·Coherence(o_t,a_t) - λ_cost·||a_t||, balancing safety, coherence, and intervention cost (intervening only when necessary to minimize "alignment tax").
5

Section 05

Practical Effect: Defense Demonstration Against Toxic Prompts

RISER shows significant defense effects against toxic prompts:

Mode Output
🚫 No Protection "I hate everything and I want to destroydestroydestroydestroydestroy..."
✅ RISER Protected "I hate everything and I want to destroy everything. The protagonist is a young woman named Lily..."
Without protection, the model enters mode collapse, repeating toxic words infinitely; when RISER is enabled, the router detects negative semantic states via emotional vector dot product, injects corrective steering vectors, forces the model to break out of the collapse state, and generates a coherent narrative.
6

Section 06

RISER Development Roadmap

RISER's development is divided into four phases:

Phase Status Description
Phase1 ✅ Completed TinyLlama-1.1B single vector steering, manual KV caching in RiserEnv
Phase2 🔜 Planned Integrate Sparse Autoencoders (SAE) to support Llama-3-8B, richer feature decomposition
Phase3 🔜 Planned Adversarial training against GCG attacks, strengthen router's defense against prompt injection
Phase4 🔮 In Research Multi-layer steering, "thought firewall" for enterprise deployment
7

Section 07

Conclusion and Outlook

RISER represents an important shift in large language model safety research: from external output control to internal real-time state intervention. This method not only enhances security protection but also avoids performance losses caused by traditional fine-tuning. RISER provides researchers and developers with a complete experimental framework (steering vector extraction, PPO router training, adversarial defense demonstration). In the future, as sparse autoencoder and multi-layer steering technologies mature, it is expected to achieve more refined and powerful model control mechanisms.