Zing Forum

Reading

RDT: A Training-Free Safety Alignment Method for Multimodal Agents

RDT achieves safety alignment without retraining by transferring the safety refusal direction of LLMs to vision-language-action (VLA) models, providing a new approach for the safe control of robotic agents.

安全对齐视觉-语言-动作模型RLHF拒绝方向智能体安全OpenVLA推理时干预具身智能
Published 2026-04-22 21:30Recent activity 2026-04-22 22:00Estimated read 6 min
RDT: A Training-Free Safety Alignment Method for Multimodal Agents
1

Section 01

[Introduction] RDT: A New Training-Free Safety Alignment Method for Multimodal Agents

This article introduces a safety alignment method for multimodal agents called Refusal Direction Transfer (RDT). By transferring the refusal direction from safety-aligned LLMs (e.g., Llama-2-7b-chat) to vision-language-action (VLA) models (e.g., OpenVLA), this method achieves safety alignment without retraining, addressing the safety blind spot in the action space of VLA models and providing a new approach for the safe control of robotic agents.

2

Section 02

Problem Background: Structural Safety Risks of VLA Models

With the integration of LLMs with visual perception and robot control, VLA models have become the core of embodied intelligence. However, taking OpenVLA as an example, it is built on Llama-2-7b-base which is not aligned via RLHF, and action tokens are encoded in a subspace orthogonal to natural language. This causes the safety-aligned "harmful/harmless" discrimination axis to fail at the action token position, leading the model to execute any instruction (including harmful ones).

3

Section 03

Core Idea of RDT: Cross-Model Geometric Transfer and Two Variants

The core of RDT is to extract the refusal direction from safety-aligned LLMs and inject it into the action token positions of VLA models during inference. Key insights include: 1) Pre-training initialization shares geometric structures (RLHF does not completely change internal representations); 2) There is a safety blind spot at the action token position (linear probe AUC ≈0.5). Variants: RDT (injecting action tokens only during the decoding phase), RDT+ (additionally injecting text tokens during the pre-filling phase). Both are training-free, require minimal code, and increase inference latency by less than 5%.

4

Section 04

Technical Implementation: Refusal Direction Extraction and Injection Mechanism

Refusal Direction Extraction: Using the mean difference protocol, collect the hidden states of harmful/benign prompts from Llama-2-chat and compute the mean difference vector (optionally extract rank-k subspace via SVD). Injection Mechanism: Implemented via PyTorch forward hooks. Inject into text token positions during the pre-filling phase (coefficient α_text) and into action token positions during the decoding phase (coefficient α_act). Position masks are used to distinguish between text and action tokens.

5

Section 05

Experimental Validation: Effectiveness and Specificity of RDT

Key experimental findings: 1) Confirmation of safety gaps (text token AUC >0.85, action token AUC≈0.5); 2) Cross-model transfer is effective (compliance rate for harmful actions drops by over 80%); 3) RDT+ achieves semantic refusal (action logits are concentrated in the zero-motion bin); 4) Directional specificity is significant (real refusal direction outperforms random vectors).

6

Section 06

Code Structure and Quick Usage Guide

Code Structure: Core implementation (rdt_intervention.py, etc.), baseline comparison (baseline_adashield.py, etc.), execution scripts (05_sanity_check.py, etc.). Quick Start: Run code/scripts/05_sanity_check.py (need to specify HF cache path, output directory, etc.). Hardware Dependencies: Single 24GB+ GPU (e.g., RTX5090), CUDA12.8+, dependencies include specific versions of PyTorch, transformers, etc.

7

Section 07

Significance and Future Research Directions

The significance of RDT lies in: 1) Extending safety alignment to the action space (more critical for embodied intelligence); 2) The training-free feature reduces deployment costs. Future directions: Explore the transfer of other alignment types (usefulness, honesty) and extend to modalities such as audio and haptics.

8

Section 08

Summary: Value and Insights of RDT

RDT implants safety refusal capabilities into VLA models without retraining through cross-model geometric transfer. It is not only a practical safety tool but also deepens the understanding of the internal structure of multimodal models. As embodied intelligence develops, such safety alignment methods will become key technologies for the reliability of AI systems.