Zing Forum

Reading

RuleSHAP: Auditing Injection Behaviors in Large Language Models Using Global Rule Extraction Technology

RuleSHAP is a novel explainable AI method that combines SHAP values with rule extraction. It can detect and explain intentionally injected misleading behaviors in large language models, providing a practical tool for AI security auditing.

RuleSHAPXAI可解释AI大语言模型LLM审计SHAP规则提取AI安全认知偏差检测KDD 2026
Published 2026-05-23 06:45Recent activity 2026-05-23 06:50Estimated read 7 min
RuleSHAP: Auditing Injection Behaviors in Large Language Models Using Global Rule Extraction Technology
1

Section 01

Introduction: RuleSHAP—An Explainable AI Tool for Auditing LLM Injection Behaviors

RuleSHAP is a novel explainable AI (XAI) method that combines SHAP values with rule extraction. It can detect and explain intentionally injected misleading behaviors in large language models (LLMs), providing a practical tool for AI security auditing. This project corresponds to a 2026 ACM SIGKDD conference paper. Its core innovation lies in combining SHAP feature attribution with rule extraction to capture feature interaction effects and generate human-understandable rule expressions.

2

Section 02

Background: Interpretability Challenges of Large Language Models

With the widespread deployment of LLMs in various scenarios, issues regarding the reliability and security of their generated content have become prominent. Traditional global explainability methods (Global XAI) are designed for structured numerical data and are difficult to directly apply to natural language input and output. This leads to a lack of effective means to understand model decision logic when auditing LLMs for injection behavior patterns, especially in key areas such as the United Nations Sustainable Development Goals (SDGs), where identifying and mitigating cognitive biases is crucial.

3

Section 03

Technical Approach of RuleSHAP

Project Overview

RuleSHAP was developed by Francesco Sovrano, providing a complete experimental workflow and toolchain to evaluate the ability of global XAI methods to detect LLM injection behaviors.

Technical Implementation Path

Adopts a text-to-ordinal feature workflow: 1. Build a topic set around SDGs with multi-dimensional scoring (prevalence, positivity, etc.); 2. Controlled behavior injection (different difficulty levels); 3. Extract output metrics (explanation length, subjectivity, etc.).

Core Mechanism

Combines SHAP-guided feature attribution with rule extraction: first calculate feature SHAP values, then extract global rules based on weighted information to capture feature interaction effects. Compared to baseline methods such as pure SHAP, decision trees, RuleFit, and GELPE, it has advantages like rule interpretability, handling complex interactions, and avoiding overfitting.

4

Section 04

Experimental Evaluation and Comparison

The project uses an evaluation framework with metrics including rule matching reciprocal rank, rule fidelity, and statistical significance tests. Experimental results show that RuleSHAP consistently outperforms traditional global XAI methods, especially in detecting non-univariate injection behaviors (complex patterns that require multi-feature combination to identify), where its advantages are more obvious.

5

Section 05

Practical Application Scenarios

RuleSHAP has application value in multiple scenarios:

  • Model Security Auditing: Detect whether LLMs have injected biases or misleading behaviors before deployment, suitable for high-risk fields such as finance and healthcare;
  • Red Team Testing: Security personnel test model robustness and identify attack vectors;
  • Model Improvement: Guide the optimization of training data or fine-tuning strategies through extracted rules;
  • Regulatory Compliance: Provide auditable and explainable methods to prove that models comply with regulations.
6

Section 06

Limitations and Future Directions

Limitations

The current implementation mainly targets the SDGs field, and its generalization ability in other fields needs to be verified; the experimental computing cost is high and requires a lot of resources.

Future Directions

Expand topic coverage, optimize computing efficiency, develop real-time detection capabilities, apply to multi-modal models, etc.

7

Section 07

Conclusion

RuleSHAP represents an important progress in the field of explainable AI, providing a powerful tool for understanding and auditing LLM behaviors. In today's era of complex and widely deployed AI systems, its ability to reveal the internal mechanisms of models has important practical value, and it is worthy of attention and exploration by researchers and practitioners in AI security, interpretability, and responsible AI development.