Zing Forum

Reading

DELMAN: A Novel Approach to Dynamically Defend Large Language Models Against Jailbreaking Attacks Using Model Editing Techniques

The DELMAN method proposed by Tsinghua University's team uses model editing techniques to dynamically defend against LLM jailbreaking attacks. Published in ACL 2025 Findings, it can effectively resist various jailbreaking attacks while maintaining the model's normal performance.

大语言模型越狱攻击模型编辑AI安全ACL 2025LLM防御Model EditingJailbreaking对齐技术
Published 2026-05-12 10:55Recent activity 2026-05-12 10:59Estimated read 5 min
DELMAN: A Novel Approach to Dynamically Defend Large Language Models Against Jailbreaking Attacks Using Model Editing Techniques
1

Section 01

DELMAN: A Novel Approach to Dynamically Defend LLM Against Jailbreaking Attacks (Introduction)

The Tsinghua University team proposed the DELMAN method, which uses model editing technology to dynamically defend large language models (LLMs) against jailbreaking attacks. This work has been accepted by ACL 2025 Findings. DELMAN can effectively resist various jailbreaking attacks while maintaining the model's normal performance, providing a new path for LLM security defense.

2

Section 02

Research Background and Limitations of Traditional Defenses

As LLM capabilities improve, the problem of jailbreaking attacks (inducing harmful content generation through carefully designed prompts) has become increasingly prominent. Traditional defense methods have limitations: input filtering during inference is easily bypassed by adversarial examples, output detection has lag; and safety alignment training is costly.

3

Section 03

Overview and Technical Principles of the DELMAN Method

DELMAN (Dynamic Defense Against Large Language Model Jailbreaking with Model Editing) is a dynamic defense mechanism that primarily leverages model editing technology (modifying specific knowledge storage points without retraining the entire model). Its key mechanisms include: 1. Attack pattern feature representation (analyze activation patterns of malicious inputs, compute covariance differences between normal and malicious inputs to form a cov matrix); 2. Dynamic knowledge editing (reversible, context-dependent, optimized by drawing on ROME/MEMIT algorithms, inject correction vectors to alter dangerous responses); 3. Preservation of original capabilities (editing is constrained to specific subspaces, not affecting general performance).

4

Section 04

Experimental Evaluation and Effect Verification

Verified on models such as Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, using the HarmBench benchmark covering various attacks (optimization-based GCG/AutoDAN, manual templates, code obfuscation). The results show: DELMAN significantly reduces the probability of harmful content generation, maintains over 95% performance on benign tasks, balancing security and usability.

5

Section 05

Implementation and Deployment Instructions

DELMAN has been open-sourced on GitHub, relying on libraries such as PyTorch and Transformers, extended based on the MEMIT/BadEdit framework. Precomputed cov matrices are provided; users are advised to recompute based on their hardware to optimize results; configuration adjustments are needed for models like Llama 3.1 (e.g., modifying the offset parameter in repr_tools.py).

6

Section 06

Research Significance and Future Outlook

DELMAN provides a new paradigm for LLM security defense (internalized security capabilities, faster response, stronger adversarial resistance); the follow-up work EVA has been accepted by IEEE TPAMI 2026, expanding the application of model editing in safety alignment. This research demonstrates the value of transforming model interpretability into security applications, providing an important direction for LLM security architecture.