Reading

DELMAN: A Novel Approach to Dynamically Defend Large Language Models Against Jailbreaking Attacks Using Model Editing Techniques

The DELMAN method proposed by Tsinghua University's team uses model editing techniques to dynamically defend against LLM jailbreaking attacks. Published in ACL 2025 Findings, it can effectively resist various jailbreaking attacks while maintaining the model's normal performance.

大语言模型越狱攻击模型编辑AI安全ACL 2025LLM防御Model EditingJailbreaking对齐技术

Published 2026-05-12 10:55Recent activity 2026-05-12 10:59Estimated read 5 min

DELMAN: A Novel Approach to Dynamically Defend Large Language Models Against Jailbreaking Attacks Using Model Editing Techniques

Section 01

DELMAN: A Novel Approach to Dynamically Defend LLM Against Jailbreaking Attacks (Introduction)

The Tsinghua University team proposed the DELMAN method, which uses model editing technology to dynamically defend large language models (LLMs) against jailbreaking attacks. This work has been accepted by ACL 2025 Findings. DELMAN can effectively resist various jailbreaking attacks while maintaining the model's normal performance, providing a new path for LLM security defense.

Section 02

Research Background and Limitations of Traditional Defenses

As LLM capabilities improve, the problem of jailbreaking attacks (inducing harmful content generation through carefully designed prompts) has become increasingly prominent. Traditional defense methods have limitations: input filtering during inference is easily bypassed by adversarial examples, output detection has lag; and safety alignment training is costly.

Section 03

Overview and Technical Principles of the DELMAN Method

DELMAN (Dynamic Defense Against Large Language Model Jailbreaking with Model Editing) is a dynamic defense mechanism that primarily leverages model editing technology (modifying specific knowledge storage points without retraining the entire model). Its key mechanisms include: 1. Attack pattern feature representation (analyze activation patterns of malicious inputs, compute covariance differences between normal and malicious inputs to form a cov matrix); 2. Dynamic knowledge editing (reversible, context-dependent, optimized by drawing on ROME/MEMIT algorithms, inject correction vectors to alter dangerous responses); 3. Preservation of original capabilities (editing is constrained to specific subspaces, not affecting general performance).

Section 04

Experimental Evaluation and Effect Verification

Verified on models such as Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, using the HarmBench benchmark covering various attacks (optimization-based GCG/AutoDAN, manual templates, code obfuscation). The results show: DELMAN significantly reduces the probability of harmful content generation, maintains over 95% performance on benign tasks, balancing security and usability.

Section 05

Implementation and Deployment Instructions

DELMAN has been open-sourced on GitHub, relying on libraries such as PyTorch and Transformers, extended based on the MEMIT/BadEdit framework. Precomputed cov matrices are provided; users are advised to recompute based on their hardware to optimize results; configuration adjustments are needed for models like Llama 3.1 (e.g., modifying the offset parameter in repr_tools.py).

Section 06

Research Significance and Future Outlook

DELMAN provides a new paradigm for LLM security defense (internalized security capabilities, faster response, stronger adversarial resistance); the follow-up work EVA has been accepted by IEEE TPAMI 2026, expanding the application of model editing in safety alignment. This research demonstrates the value of transforming model interpretability into security applications, providing an important direction for LLM security architecture.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54