Zing Forum

Reading

A PDF Malware Detection Framework Integrating Graph Neural Networks and Large Language Models

This article introduces the GNN-LLM-PDF-Malware project, which innovatively combines graph neural networks (GNN) and large language models (LLM) to achieve PDF malware family classification, subfamily identification, and behavior analysis, providing a multi-layered threat detection solution for the cybersecurity field.

图神经网络GNN大语言模型LLMPDF恶意软件网络安全恶意软件检测深度学习威胁情报
Published 2026-05-09 16:52Recent activity 2026-05-09 17:00Estimated read 7 min
A PDF Malware Detection Framework Integrating Graph Neural Networks and Large Language Models
1

Section 01

Introduction: Core Overview of the PDF Malware Detection Framework Integrating GNN and LLM

This article introduces the GNN-LLM-PDF-Malware project on GitHub. This framework innovatively integrates graph neural networks (GNN) and large language models (LLM) to achieve PDF malware family classification, subfamily identification, and behavior analysis, breaking through the limitations of traditional binary detection and providing a multi-layered threat detection solution for the cybersecurity field.

2

Section 02

Challenges and Requirements for PDF Malware Detection

Due to its complex structure (including JS code, embedded files, action scripts, etc.), PDF has become an important carrier for malicious propagation. Traditional detection methods (signature-based, static signatures) struggle to cope with evolving attacks, and simple machine learning cannot capture the structured information inside documents; moreover, merely determining whether a PDF is malicious or benign is insufficient—security analysts need deep intelligence such as family, behavior, and vulnerabilities to formulate defense strategies.

3

Section 03

Core Innovation of the Framework: Organic Integration of GNN and LLM

Application of GNN

The internal structure of PDF is suitable for graph representation (object references, JS call chains, etc., are modeled as nodes/edges). GNN excels at processing non-Euclidean data, can capture local structural features, propagate information, and identify abnormal subgraphs. The corresponding code file is Feature_Extraction_GNN.py.

Enhancement by LLM

LLM is good at understanding text content (JS code, metadata, etc.), can analyze code semantics, perform context reasoning, and generate behavior descriptions. The corresponding code includes the Finetune_LLM directory and LLM_evaluate.py.

4

Section 04

Analysis of the Three-Stage Detection Process

The framework adopts a phased strategy:

  1. Stage 1 & 2: GNN Feature Extraction: Parse the PDF into a graph structure via Feature_Extraction_GNN (Stage1 & 2).py, and GNN learns node representations (including structural and contextual information).
  2. Stage 3: LLM Evaluation and Analysis: Output classification results via LLM_evaluate (Stage 3).py and generate natural language descriptions of malware behavior to assist analysts in understanding threats.
5

Section 05

Dataset and Key Technical Implementation Details

The project includes a Dataset directory that provides labeled data required for training (high-quality data is key to model robustness). Technical details include:

  • PDF parsing and graph construction: Requires in-depth knowledge of PDF format to parse object structures, reference relationships, and active content.
  • GNN model design: Optional architectures such as GCN, GAT, GraphSAGE are available; need to handle variable-length graph structures and learn discriminative representations.
  • LLM fine-tuning strategy: Need to select a base model, design task prompt templates, handle long text input, etc.
6

Section 06

Application Scenarios and Practical Value

This framework has value in multiple scenarios:

  • Enterprise SOC: Automatically detect and provide context to help analysts quickly understand the severity of threats.
  • Threat Intelligence Analysis: Track attack organization activities through family classification and understand malware evolution trends.
  • Sandbox Enhancement: Static analysis complements dynamic sandbox results to form a comprehensive threat profile.
7

Section 07

Limitations and Future Development Directions

The framework faces challenges:

  • Adversarial Sample Attacks: Malicious authors may modify PDF structures to evade detection, so model robustness needs to be enhanced.
  • Computational Efficiency: GNN and LLM are computationally intensive; inference speed needs to be optimized to handle large-scale scanning. Future directions: Explore multi-modal fusion (adding visual information such as PDF rendered images) to improve detection capabilities.
8

Section 08

Conclusion: A New Direction for Security Defense via Multi-Technology Integration

The GNN-LLM-PDF-Malware project represents an important development direction in malware detection. By combining GNN's structural modeling and LLM's semantic understanding capabilities, it achieves a leap from simple classification to in-depth analysis. For security practitioners and researchers, it is both a practical tool and an important reference for AI applications in the security field, and will become a key component of next-generation security defense systems.