Zing Forum

Reading

Open-Source Vulnerability Auto-Detection Based on BERT: From DiverseVul Dataset to Practical Applications

An open-source project developed by Sameera Ali that uses the BERT model for three-stage progressive training on the DiverseVul dataset to achieve automated vulnerability detection in C/C++ code, providing a practical AI solution for open-source software security.

漏洞檢測BERTDiverseVul程式碼安全靜態分析開源安全PyTorchHuggingFaceC/C++機器學習
Published 2026-05-24 17:12Recent activity 2026-05-24 17:22Estimated read 7 min
Open-Source Vulnerability Auto-Detection Based on BERT: From DiverseVul Dataset to Practical Applications
1

Section 01

Introduction to the BERT-Based Open-Source Vulnerability Auto-Detection Project

An open-source project developed by Sameera Ali that uses the BERT model for three-stage progressive training on the DiverseVul dataset to achieve automated vulnerability detection in C/C++ code, providing a practical AI solution for open-source software security.

2

Section 02

Project Background and Problem Awareness

In the open-source software ecosystem, vulnerability detection has always been a time-consuming task requiring high professional knowledge. While traditional static analysis tools can detect common vulnerability patterns, they fall short when facing complex program logic and new attack methods. With the maturity of large language model (LLM) technology, using AI for automated vulnerability detection has become a promising research direction. This project selects the DiverseVul dataset as the training foundation, which contains real-world function-level vulnerability annotations for C/C++ and covers various common security vulnerability types.

3

Section 03

Core Technical Architecture and Training Strategy

Model Selection

Adopts bert-base-uncased as the base model for the following reasons: it is general-purpose and easily accessible, effectively validates the core hypothesis of 'transferring natural language processing technology to code analysis', and leaves room for subsequent comparison experiments with CodeBERT.

Three-Stage Progressive Training

  1. Pilot (Pilot Verification):200 data samples, 1 epoch, maximum sequence length of 64 tokens, quickly verifying pipeline feasibility.
  2. Scaled (Scale Expansion):5000 data samples, 2 epochs, maximum sequence length of 128 tokens, adjusting hyperparameters to capture complex vulnerability patterns.
  3. Full (Full Training):Complete DiverseVul dataset, 3 epochs, maximum sequence length of 512 tokens, achieving optimal detection performance. The advantage of this strategy: reduces the risk of experimental failure and waste of computing resources.
4

Section 04

Data Processing Workflow and Tech Stack

Data Processing Workflow

  1. Raw data loading: Read C/C++ function code and vulnerability labels from DiverseVul CSV.
  2. Preprocessing: Remove null values, format strings, label encoding (0=safe, 1=vulnerable).
  3. Tokenization: Use bert-base-uncased tokenizer for word segmentation (padding + truncation).
  4. Model fine-tuning: Use AutoModelForSequenceClassification for binary classification training.
  5. Evaluation: 70/30 training-test split, using accuracy and weighted F1 score as metrics.

Tech Stack

  • Deep learning framework: PyTorch
  • NLP tools: HuggingFace Transformers, Datasets, Evaluate
  • Data processing: Pandas
  • Development environment: Jupyter Notebook, Python 3.x
5

Section 05

Vulnerability Detection Capabilities and Current Limitations

Detection Capability Scope

Able to identify various common vulnerability types:

  • Hardcoded credentials
  • SQL injection
  • Buffer overflow
  • Cross-site scripting (XSS)
  • Use of outdated/unsafe functions (e.g., strcpy, gets)

Current Limitations

  • BERT's context window limitation makes it difficult to capture cross-function vulnerability patterns.
  • May generate false positives for safe code similar to vulnerability patterns.
  • Using CodeBERT or domain-specific pre-training can further improve the F1 score.
6

Section 06

Future Outlook and Project Value Summary

Future Directions

  • Replace BERT with CodeBERT to obtain more suitable embedding representations for code.
  • Integrate into CI/CD pipeline to实现 real-time Pull Request scanning.
  • Add CWE category classification to achieve multi-label prediction.
  • Compare effectiveness with existing static analysis tools (CodeQL, Bandit).

Project Value

  • Translate academic research results (DiverseVul dataset) into executable open-source tools.
  • Provide a baseline solution for 'AI-assisted code security analysis'.
  • Offer an excellent entry reference for developers entering the code intelligence field.
  • Contribute to software supply chain security as a starting point for standard tools in the development workflow.