Zing Forum

Reading

Hybrid Malware Detection System Integrating Traditional Machine Learning and Large Language Models

This article introduces an innovative malware detection solution that constructs a hybrid detection system with both high accuracy and interpretability by integrating multi-dimensional feature engineering (including TF-IDF, statistical features, and BERT embeddings) and explainable AI technology.

恶意软件检测机器学习BERTTF-IDFSHAP可解释AI网络安全API调用序列特征工程
Published 2026-05-12 16:19Recent activity 2026-05-12 17:23Estimated read 7 min
Hybrid Malware Detection System Integrating Traditional Machine Learning and Large Language Models
1

Section 01

[Introduction] Hybrid Malware Detection System Integrating Traditional ML and Large Language Models

This article introduces an innovative hybrid malware detection system. Its core lies in multi-dimensional feature engineering that integrates traditional machine learning (TF-IDF, statistical features) and large language models (BERT embeddings), combined with SHAP explainable AI technology. It aims to address the shortcomings of traditional signature-based detection in dealing with zero-day attacks and polymorphic malware, while improving the interpretability of detection results, providing a new solution for the cybersecurity field.

2

Section 02

Background: New Threats and Pain Points in Cybersecurity

Under digital transformation, the complexity of cybersecurity threats has increased dramatically: traditional signature-based detection methods struggle to handle zero-day attacks and polymorphic malware; enterprise SOCs face pressure from massive alerts and a shortage of security talents. How to use AI to improve detection accuracy, efficiency, and interpretability has become an important issue in the industry.

3

Section 03

Methodology: System Architecture Design with Multi-Dimensional Feature Fusion

Core Design Philosophy

Adopt a "multi-dimensional feature fusion" strategy, utilizing three complementary features simultaneously:

  1. Lexical-level features (TF-IDF) : Capture statistical patterns of API call sequences (unigram + bigram hybrid model);
  2. Behavioral statistical features: Quantify the overall behavior of programs (API sequence length, number of unique APIs, frequency of file/registry/network operations, etc.);
  3. Semantic embedding features (BERT): Understand the contextual semantics of APIs through the SentenceTransformer model.

Data Processing Flow

API call sequence → Data preprocessing → Feature engineering → Hybrid feature fusion → Machine learning model → Evaluation metrics → SHAP interpretability analysis The preprocessing stage cleans data, handles missing values, and converts API sequences into text form.

4

Section 04

Evidence: Dataset, Experimental Setup, and Performance Results

Dataset

Use the MalBehavD-V1 dataset (2570 samples, 1285 malicious and 1285 benign), which records Windows API call sequences covering various behavior types.

Experimental Setup

Evaluate 5 classifiers: Random Forest, XGBoost, LightGBM, Logistic Regression, Naive Bayes.

Performance Results

Models integrating the three features are significantly better than the baseline model using only TF-IDF: improved accuracy, reduced false positive rate, and enhanced generalization ability; ensemble methods (Random Forest, XGBoost, LightGBM) perform best—LightGBM balances training speed and performance, while XGBoost has the highest accuracy.

5

Section 05

Interpretability: SHAP Technology Enhances Model Trust

Necessity

The security field requires model interpretability to help analysts understand the basis for judgments and avoid anxiety about false positives/negatives.

SHAP Application

Based on game theory's Shapley value, calculate the marginal contribution of features:

  • Global feature importance: Identify key features of the model as a whole;
  • Local prediction explanation: Explain the judgment basis for individual samples, allowing analysts to know "which API patterns lead to malicious determination".
6

Section 06

Deployment Recommendations and Future Research Directions

Technology Stack

Based on the Python ecosystem: pandas, numpy (data processing), scikit-learn/xgboost/lightgbm (ML), sentence-transformers (BERT), shap (interpretability).

Deployment Optimization

  • Model persistence (joblib);
  • Batch extraction of BERT embeddings;
  • Incremental learning to adapt to new variants.

Limitations and Future

Limitations: Static analysis has limited effect on anti-API obfuscation/dynamic loading; small dataset size; BERT's computational overhead affects real-time performance. Future Directions: Try LSTM/Transformer sequence models, integrate sandbox dynamic analysis, improve adversarial robustness, multi-modal fusion (file static + network traffic features).

7

Section 07

Conclusion: A Feasible Path for AI-Driven Security Detection

This open-source project demonstrates an innovative integration idea of traditional ML and deep learning. Through multi-dimensional feature fusion and explainable AI, it has achieved results in malware detection. Its core ideas (complementary feature fusion, emphasis on interpretability) can be extended to security tasks such as intrusion detection and phishing identification. Accurate and interpretable AI systems will become a key part of the defense system, gaining the trust of security practitioners and exerting practical value.