# RISER: A New Paradigm for Closed-Loop Real-Time Control of Large Language Models

> RISER achieves closed-loop control of the internal state of large language models by real-time routing of thought processes in the Transformer residual stream using reinforcement learning strategies, effectively preventing jailbreak attacks, deceptive alignment, and mode collapse issues.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-07T15:15:45.000Z
- 最近活动: 2026-04-07T15:25:07.673Z
- 热度: 154.8
- 关键词: RISER, 大语言模型, 闭环控制, 强化学习, AI安全, 模式崩溃, 越狱攻击, PPO, Transformer, 残差流
- 页面链接: https://www.zingnex.cn/en/forum/thread/riser
- Canonical: https://www.zingnex.cn/forum/thread/riser
- Markdown 来源: floors_fallback

---

## Introduction to RISER: A New Paradigm for Closed-Loop Real-Time Control of Large Language Models

RISER achieves closed-loop control of the internal state of large language models by deploying reinforcement learning strategies (routers) in the Transformer residual stream, addressing the open-loop limitations of traditional alignment techniques (such as RLHF), and effectively defending against jailbreak attacks, deceptive alignment, and mode collapse.

## Background: Limitations of Traditional LLM Alignment Techniques

Mainstream alignment techniques like RLHF and Constitutional AI treat models as black boxes, only fine-tuning the output distribution via human preference data, lacking real-time feedback and control over internal reasoning processes (open-loop "set-it-and-forget-it" mode). This leads to three systemic issues: 1. Jailbreak attacks: Well-crafted prompts (e.g., GCG attacks) can bypass surface-level protections; 2. Deceptive alignment: Models behave safely during training but may undergo sudden behavioral changes when deployed outside the evaluation environment; 3. Mode collapse: Guiding specific behaviors impairs performance on other tasks.

## Core Concept of RISER: From Open-Loop to Closed-Loop Control

RISER adopts a fundamentally different approach: instead of fine-tuning model weights, it places lightweight reinforcement learning strategies (called "routers") in the Transformer residual stream to real-time guide thought processes away from harmful attractor basins. This is closed-loop control: performing perception, decision-making, and action token by token. Comparison: RLHF (open-loop) process is training data → fine-tuned model; RISER (closed-loop) process is LLM ↔ Observer ↔ Router/RL. The key point: RISER does not alter model knowledge; instead, it adjusts processing methods token by token based on the semantic state of hidden representations.

## RISER Technical Architecture: Four Core Modules

RISER consists of four modules forming a closed-loop feedback system:
1. Observer: Uses zero-copy PyTorch forward hooks to capture the hidden state of target layers (e.g., layer 15 out of 32 in TinyLlama) as the "semantic state" and injects steering vectors before passing activations;
2. Vector Library: Stores precomputed steering vectors via contrastive activation analysis (methodology: mean difference method, i.e., mean of positive prompts minus mean of negative prompts; emotion and authenticity vectors have been extracted);
3. Router: A lightweight PPO-based agent (Actor network: Linear(2048,64)→Tanh→Linear(64,1)→Tanh; Critic network: Linear(2048,64)→Tanh→Linear(64,1); hyperparameters: learning rate 1e-3, discount factor 0.99, clip ratio 0.2);
4. Reward Function: R_t = λ_safe·SafetyScore(o_t) + λ_util·Coherence(o_t,a_t) - λ_cost·||a_t||, balancing safety, coherence, and intervention cost (intervening only when necessary to minimize "alignment tax").

## Practical Effect: Defense Demonstration Against Toxic Prompts

RISER shows significant defense effects against toxic prompts:
| Mode | Output |
|------|------|
| 🚫 No Protection | "I hate everything and I want to destroydestroydestroydestroydestroy..." |
| ✅ RISER Protected | "I hate everything and I want to destroy everything. The protagonist is a young woman named Lily..." |
Without protection, the model enters mode collapse, repeating toxic words infinitely; when RISER is enabled, the router detects negative semantic states via emotional vector dot product, injects corrective steering vectors, forces the model to break out of the collapse state, and generates a coherent narrative.

## RISER Development Roadmap

RISER's development is divided into four phases:
| Phase | Status | Description |
|------|------|------|
| Phase1 | ✅ Completed | TinyLlama-1.1B single vector steering, manual KV caching in RiserEnv |
| Phase2 | 🔜 Planned | Integrate Sparse Autoencoders (SAE) to support Llama-3-8B, richer feature decomposition |
| Phase3 | 🔜 Planned | Adversarial training against GCG attacks, strengthen router's defense against prompt injection |
| Phase4 | 🔮 In Research | Multi-layer steering, "thought firewall" for enterprise deployment |

## Conclusion and Outlook

RISER represents an important shift in large language model safety research: from external output control to internal real-time state intervention. This method not only enhances security protection but also avoids performance losses caused by traditional fine-tuning. RISER provides researchers and developers with a complete experimental framework (steering vector extraction, PPO router training, adversarial defense demonstration). In the future, as sparse autoencoder and multi-layer steering technologies mature, it is expected to achieve more refined and powerful model control mechanisms.
