Reading

RDT: A Training-Free Safety Alignment Method for Multimodal Agents

RDT achieves safety alignment without retraining by transferring the safety refusal direction of LLMs to vision-language-action (VLA) models, providing a new approach for the safe control of robotic agents.

安全对齐视觉-语言-动作模型RLHF拒绝方向智能体安全OpenVLA推理时干预具身智能

Published 2026-04-22 21:30Recent activity 2026-04-22 22:00Estimated read 6 min

RDT: A Training-Free Safety Alignment Method for Multimodal Agents

Section 01

[Introduction] RDT: A New Training-Free Safety Alignment Method for Multimodal Agents

This article introduces a safety alignment method for multimodal agents called Refusal Direction Transfer (RDT). By transferring the refusal direction from safety-aligned LLMs (e.g., Llama-2-7b-chat) to vision-language-action (VLA) models (e.g., OpenVLA), this method achieves safety alignment without retraining, addressing the safety blind spot in the action space of VLA models and providing a new approach for the safe control of robotic agents.

Section 02

Problem Background: Structural Safety Risks of VLA Models

With the integration of LLMs with visual perception and robot control, VLA models have become the core of embodied intelligence. However, taking OpenVLA as an example, it is built on Llama-2-7b-base which is not aligned via RLHF, and action tokens are encoded in a subspace orthogonal to natural language. This causes the safety-aligned "harmful/harmless" discrimination axis to fail at the action token position, leading the model to execute any instruction (including harmful ones).

Section 03

Core Idea of RDT: Cross-Model Geometric Transfer and Two Variants

The core of RDT is to extract the refusal direction from safety-aligned LLMs and inject it into the action token positions of VLA models during inference. Key insights include: 1) Pre-training initialization shares geometric structures (RLHF does not completely change internal representations); 2) There is a safety blind spot at the action token position (linear probe AUC ≈0.5). Variants: RDT (injecting action tokens only during the decoding phase), RDT+ (additionally injecting text tokens during the pre-filling phase). Both are training-free, require minimal code, and increase inference latency by less than 5%.

Section 04

Technical Implementation: Refusal Direction Extraction and Injection Mechanism

Refusal Direction Extraction: Using the mean difference protocol, collect the hidden states of harmful/benign prompts from Llama-2-chat and compute the mean difference vector (optionally extract rank-k subspace via SVD). Injection Mechanism: Implemented via PyTorch forward hooks. Inject into text token positions during the pre-filling phase (coefficient α_text) and into action token positions during the decoding phase (coefficient α_act). Position masks are used to distinguish between text and action tokens.

Section 05

Experimental Validation: Effectiveness and Specificity of RDT

Key experimental findings: 1) Confirmation of safety gaps (text token AUC >0.85, action token AUC≈0.5); 2) Cross-model transfer is effective (compliance rate for harmful actions drops by over 80%); 3) RDT+ achieves semantic refusal (action logits are concentrated in the zero-motion bin); 4) Directional specificity is significant (real refusal direction outperforms random vectors).

Section 06

Code Structure and Quick Usage Guide

Code Structure: Core implementation (rdt_intervention.py, etc.), baseline comparison (baseline_adashield.py, etc.), execution scripts (05_sanity_check.py, etc.). Quick Start: Run code/scripts/05_sanity_check.py (need to specify HF cache path, output directory, etc.). Hardware Dependencies: Single 24GB+ GPU (e.g., RTX5090), CUDA12.8+, dependencies include specific versions of PyTorch, transformers, etc.

Section 07

Significance and Future Research Directions

The significance of RDT lies in: 1) Extending safety alignment to the action space (more critical for embodied intelligence); 2) The training-free feature reduces deployment costs. Future directions: Explore the transfer of other alignment types (usefulness, honesty) and extend to modalities such as audio and haptics.

Section 08

Summary: Value and Insights of RDT

RDT implants safety refusal capabilities into VLA models without retraining through cross-model geometric transfer. It is not only a practical safety tool but also deepens the understanding of the internal structure of multimodal models. As embodied intelligence develops, such safety alignment methods will become key technologies for the reliability of AI systems.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49