Zing Forum

Reading

XAI-driven Voice Deepfake Detection: Interpretable Analysis via Multimodal Large Language Model Generation

This article introduces a voice deepfake detection method based on XAI (Explainable Artificial Intelligence) and multimodal large language models. By visualizing model features and converting them into natural language explanations, it achieves interpretable analysis of detection results. This method requires no additional training and can be directly applied to existing voice fake detection models.

语音深度伪造检测可解释人工智能多模态大语言模型XAI语音安全模型可解释性深度学习
Published 2026-06-12 12:27Recent activity 2026-06-12 12:51Estimated read 6 min
XAI-driven Voice Deepfake Detection: Interpretable Analysis via Multimodal Large Language Model Generation
1

Section 01

[Introduction] XAI-driven Voice Deepfake Detection: Interpretable Analysis via Multimodal Large Language Model Generation

This project is developed by the GLAM Lab at Imperial College London, proposing a voice deepfake detection method based on XAI (Explainable Artificial Intelligence) and multimodal large language models. The core innovation is that it can integrate existing detection models without additional training, solving the "black box" problem of traditional models through feature visualization and natural language explanations, thus enhancing the credibility and application value of results.

2

Section 02

Background: Threats of Voice Deepfakes and Black Box Dilemma of Existing Detection Models

The rapid development of generative AI technology has made voice deepfakes a serious security threat (e.g., fake calls, false speeches). Existing detection systems can output results but cannot explain the decision-making basis, leading to three major challenges: security analysts struggle to understand decisions, ordinary users distrust results, and researchers find it hard to detect model biases. Therefore, an interpretable detection system is crucial.

3

Section 03

Technical Architecture: Three-Stage Seamless Interpretable Pipeline

The system adopts a "Prediction-Visualization-Explanation" pipeline:

  1. Prediction and Feature Extraction: Supports mainstream detection models such as Wav2Vec 2.0, HuBERT, WavLM, and OpenSMILE, extracting key features;
  2. XAI Visualization: Generates heatmaps/attribution maps via Integrated Gradients, Saliency Maps, LIME, and SHAP, marking suspicious time periods and frequencies;
  3. Multimodal Explanation Generation: Uses Qwen2.5-VL-7B-Instruct to read visualization results and generate natural language explanations containing specific abnormal details.
4

Section 04

Experiments: Single-Model and Multi-Model Fusion Effects on the PartialSpoof Dataset

The experiments are based on the PartialSpoof anti-spoofing dataset, using Conda environment to manage dependencies. Results show:

  • Single-Model XAI: Can locate anomalies but has biases (e.g., Saliency overfocuses on high-frequency noise);
  • Three-Model Fusion: Combines XAI evidence from Wav2Vec, HuBERT, and WavLM, resulting in more comprehensive and accurate explanations (complementary features, consistent verification, error suppression), and can describe specific anomalies (e.g., high-frequency artifacts at 1.5 seconds, pitch anomalies between 4.5-5.0 seconds).
5

Section 05

Application Value: Empowering Multiple Scenarios from Security Auditing to Forensic Investigation

The practical value of the project is reflected in:

  • Security Auditing: Helps teams understand decisions and identify attack patterns;
  • User Trust: Shows detection basis to users, enhancing trust;
  • Model Improvement: Discovers misjudgment patterns through explanations to optimize detection models;
  • Forensic Investigation: Structured explanations can supplement expert testimony.
6

Section 06

Limitations and Outlook: Computational Overhead, Language Restrictions, and Optimization Directions

Current limitations:

  • High computational resource requirements (multiple XAI methods + multimodal models);
  • Mainly supports English voice;
  • Insufficient real-time performance of the process. Future directions: Optimize explanation speed, expand multilingual support, and explore lightweight multimodal models.
7

Section 07

Conclusion: The AI Explains AI Paradigm Provides New Ideas for Trustworthy AI Systems

This project proves that multimodal large language models can be used to explain the decisions of other AI systems, and the "AI explains AI" paradigm provides a new path for solving the interpretability problem of black-box models. For developers and security practitioners, this tool can provide credible decision-making basis when deploying detection systems, which is a key element in building trustworthy AI in the era of deepfakes.