# DELMAN: A Novel Approach to Dynamically Defend Large Language Models Against Jailbreaking Attacks Using Model Editing Techniques

> The DELMAN method proposed by Tsinghua University's team uses model editing techniques to dynamically defend against LLM jailbreaking attacks. Published in ACL 2025 Findings, it can effectively resist various jailbreaking attacks while maintaining the model's normal performance.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-12T02:55:19.000Z
- 最近活动: 2026-05-12T02:59:19.547Z
- 热度: 143.9
- 关键词: 大语言模型, 越狱攻击, 模型编辑, AI安全, ACL 2025, LLM防御, Model Editing, Jailbreaking, 对齐技术
- 页面链接: https://www.zingnex.cn/en/forum/thread/delman
- Canonical: https://www.zingnex.cn/forum/thread/delman
- Markdown 来源: floors_fallback

---

## DELMAN: A Novel Approach to Dynamically Defend LLM Against Jailbreaking Attacks (Introduction)

The Tsinghua University team proposed the DELMAN method, which uses model editing technology to dynamically defend large language models (LLMs) against jailbreaking attacks. This work has been accepted by ACL 2025 Findings. DELMAN can effectively resist various jailbreaking attacks while maintaining the model's normal performance, providing a new path for LLM security defense.

## Research Background and Limitations of Traditional Defenses

As LLM capabilities improve, the problem of jailbreaking attacks (inducing harmful content generation through carefully designed prompts) has become increasingly prominent. Traditional defense methods have limitations: input filtering during inference is easily bypassed by adversarial examples, output detection has lag; and safety alignment training is costly.

## Overview and Technical Principles of the DELMAN Method

DELMAN (Dynamic Defense Against Large Language Model Jailbreaking with Model Editing) is a dynamic defense mechanism that primarily leverages model editing technology (modifying specific knowledge storage points without retraining the entire model). Its key mechanisms include: 1. Attack pattern feature representation (analyze activation patterns of malicious inputs, compute covariance differences between normal and malicious inputs to form a cov matrix); 2. Dynamic knowledge editing (reversible, context-dependent, optimized by drawing on ROME/MEMIT algorithms, inject correction vectors to alter dangerous responses); 3. Preservation of original capabilities (editing is constrained to specific subspaces, not affecting general performance).

## Experimental Evaluation and Effect Verification

Verified on models such as Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, using the HarmBench benchmark covering various attacks (optimization-based GCG/AutoDAN, manual templates, code obfuscation). The results show: DELMAN significantly reduces the probability of harmful content generation, maintains over 95% performance on benign tasks, balancing security and usability.

## Implementation and Deployment Instructions

DELMAN has been open-sourced on GitHub, relying on libraries such as PyTorch and Transformers, extended based on the MEMIT/BadEdit framework. Precomputed cov matrices are provided; users are advised to recompute based on their hardware to optimize results; configuration adjustments are needed for models like Llama 3.1 (e.g., modifying the offset parameter in repr_tools.py).

## Research Significance and Future Outlook

DELMAN provides a new paradigm for LLM security defense (internalized security capabilities, faster response, stronger adversarial resistance); the follow-up work EVA has been accepted by IEEE TPAMI 2026, expanding the application of model editing in safety alignment. This research demonstrates the value of transforming model interpretability into security applications, providing an important direction for LLM security architecture.