Reading

RISER: A New Paradigm for Closed-Loop Real-Time Control of Large Language Models

RISER achieves closed-loop control of the internal state of large language models by real-time routing of thought processes in the Transformer residual stream using reinforcement learning strategies, effectively preventing jailbreak attacks, deceptive alignment, and mode collapse issues.

RISER大语言模型闭环控制强化学习AI安全模式崩溃越狱攻击PPOTransformer残差流

Published 2026-04-07 23:15Recent activity 2026-04-07 23:25Estimated read 8 min

Section 01

Introduction to RISER: A New Paradigm for Closed-Loop Real-Time Control of Large Language Models

RISER achieves closed-loop control of the internal state of large language models by deploying reinforcement learning strategies (routers) in the Transformer residual stream, addressing the open-loop limitations of traditional alignment techniques (such as RLHF), and effectively defending against jailbreak attacks, deceptive alignment, and mode collapse.

Section 02

Background: Limitations of Traditional LLM Alignment Techniques

Mainstream alignment techniques like RLHF and Constitutional AI treat models as black boxes, only fine-tuning the output distribution via human preference data, lacking real-time feedback and control over internal reasoning processes (open-loop "set-it-and-forget-it" mode). This leads to three systemic issues: 1. Jailbreak attacks: Well-crafted prompts (e.g., GCG attacks) can bypass surface-level protections; 2. Deceptive alignment: Models behave safely during training but may undergo sudden behavioral changes when deployed outside the evaluation environment; 3. Mode collapse: Guiding specific behaviors impairs performance on other tasks.

Section 03

Core Concept of RISER: From Open-Loop to Closed-Loop Control

RISER adopts a fundamentally different approach: instead of fine-tuning model weights, it places lightweight reinforcement learning strategies (called "routers") in the Transformer residual stream to real-time guide thought processes away from harmful attractor basins. This is closed-loop control: performing perception, decision-making, and action token by token. Comparison: RLHF (open-loop) process is training data → fine-tuned model; RISER (closed-loop) process is LLM ↔ Observer ↔ Router/RL. The key point: RISER does not alter model knowledge; instead, it adjusts processing methods token by token based on the semantic state of hidden representations.

Section 04

RISER Technical Architecture: Four Core Modules

RISER consists of four modules forming a closed-loop feedback system:

Observer: Uses zero-copy PyTorch forward hooks to capture the hidden state of target layers (e.g., layer 15 out of 32 in TinyLlama) as the "semantic state" and injects steering vectors before passing activations;
Vector Library: Stores precomputed steering vectors via contrastive activation analysis (methodology: mean difference method, i.e., mean of positive prompts minus mean of negative prompts; emotion and authenticity vectors have been extracted);
Router: A lightweight PPO-based agent (Actor network: Linear(2048,64)→Tanh→Linear(64,1)→Tanh; Critic network: Linear(2048,64)→Tanh→Linear(64,1); hyperparameters: learning rate 1e-3, discount factor 0.99, clip ratio 0.2);
Reward Function: R_t = λ_safe·SafetyScore(o_t) + λ_util·Coherence(o_t,a_t) - λ_cost·||a_t||, balancing safety, coherence, and intervention cost (intervening only when necessary to minimize "alignment tax").

Section 05

Practical Effect: Defense Demonstration Against Toxic Prompts

RISER shows significant defense effects against toxic prompts:

Mode	Output
🚫 No Protection	"I hate everything and I want to destroydestroydestroydestroydestroy..."
✅ RISER Protected	"I hate everything and I want to destroy everything. The protagonist is a young woman named Lily..."
Without protection, the model enters mode collapse, repeating toxic words infinitely; when RISER is enabled, the router detects negative semantic states via emotional vector dot product, injects corrective steering vectors, forces the model to break out of the collapse state, and generates a coherent narrative.

Section 06

RISER Development Roadmap

RISER's development is divided into four phases:

Phase	Status	Description
Phase1	✅ Completed	TinyLlama-1.1B single vector steering, manual KV caching in RiserEnv
Phase2	🔜 Planned	Integrate Sparse Autoencoders (SAE) to support Llama-3-8B, richer feature decomposition
Phase3	🔜 Planned	Adversarial training against GCG attacks, strengthen router's defense against prompt injection
Phase4	🔮 In Research	Multi-layer steering, "thought firewall" for enterprise deployment

Section 07

Conclusion and Outlook

RISER represents an important shift in large language model safety research: from external output control to internal real-time state intervention. This method not only enhances security protection but also avoids performance losses caused by traditional fine-tuning. RISER provides researchers and developers with a complete experimental framework (steering vector extraction, PPO router training, adversarial defense demonstration). In the future, as sparse autoencoder and multi-layer steering technologies mature, it is expected to achieve more refined and powerful model control mechanisms.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15