Reading

Open-Source Vulnerability Auto-Detection Based on BERT: From DiverseVul Dataset to Practical Applications

An open-source project developed by Sameera Ali that uses the BERT model for three-stage progressive training on the DiverseVul dataset to achieve automated vulnerability detection in C/C++ code, providing a practical AI solution for open-source software security.

漏洞檢測BERTDiverseVul程式碼安全靜態分析開源安全PyTorchHuggingFaceC/C++機器學習

Published 2026-05-24 17:12Recent activity 2026-05-24 17:22Estimated read 7 min

Open-Source Vulnerability Auto-Detection Based on BERT: From DiverseVul Dataset to Practical Applications

Section 01

Introduction to the BERT-Based Open-Source Vulnerability Auto-Detection Project

Original Author/Maintainer: Sameera Ali
Source Platform: GitHub
Original Title: LLM_vulnerability_detection-open_source_vulnerability_detection
Original Link: https://github.com/alisameera/LLM_vulnerability_detection-open_source_vulnerability_detection
Source Publication Date: 2026-05-24

Section 02

Project Background and Problem Awareness

In the open-source software ecosystem, vulnerability detection has always been a time-consuming task requiring high professional knowledge. While traditional static analysis tools can detect common vulnerability patterns, they fall short when facing complex program logic and new attack methods. With the maturity of large language model (LLM) technology, using AI for automated vulnerability detection has become a promising research direction. This project selects the DiverseVul dataset as the training foundation, which contains real-world function-level vulnerability annotations for C/C++ and covers various common security vulnerability types.

Section 03

Core Technical Architecture and Training Strategy

Model Selection

Adopts bert-base-uncased as the base model for the following reasons: it is general-purpose and easily accessible, effectively validates the core hypothesis of 'transferring natural language processing technology to code analysis', and leaves room for subsequent comparison experiments with CodeBERT.

Three-Stage Progressive Training

Pilot (Pilot Verification)：200 data samples, 1 epoch, maximum sequence length of 64 tokens, quickly verifying pipeline feasibility.
Scaled (Scale Expansion)：5000 data samples, 2 epochs, maximum sequence length of 128 tokens, adjusting hyperparameters to capture complex vulnerability patterns.
Full (Full Training)：Complete DiverseVul dataset, 3 epochs, maximum sequence length of 512 tokens, achieving optimal detection performance. The advantage of this strategy: reduces the risk of experimental failure and waste of computing resources.

Section 04

Data Processing Workflow and Tech Stack

Data Processing Workflow

Raw data loading: Read C/C++ function code and vulnerability labels from DiverseVul CSV.
Preprocessing: Remove null values, format strings, label encoding (0=safe, 1=vulnerable).
Tokenization: Use bert-base-uncased tokenizer for word segmentation (padding + truncation).
Model fine-tuning: Use AutoModelForSequenceClassification for binary classification training.
Evaluation: 70/30 training-test split, using accuracy and weighted F1 score as metrics.

Tech Stack

Deep learning framework: PyTorch
NLP tools: HuggingFace Transformers, Datasets, Evaluate
Data processing: Pandas
Development environment: Jupyter Notebook, Python 3.x

Section 05

Vulnerability Detection Capabilities and Current Limitations

Detection Capability Scope

Able to identify various common vulnerability types:

Hardcoded credentials
SQL injection
Buffer overflow
Cross-site scripting (XSS)
Use of outdated/unsafe functions (e.g., strcpy, gets)

Current Limitations

BERT's context window limitation makes it difficult to capture cross-function vulnerability patterns.
May generate false positives for safe code similar to vulnerability patterns.
Using CodeBERT or domain-specific pre-training can further improve the F1 score.

Section 06

Future Outlook and Project Value Summary

Future Directions

Replace BERT with CodeBERT to obtain more suitable embedding representations for code.
Integrate into CI/CD pipeline to实现 real-time Pull Request scanning.
Add CWE category classification to achieve multi-label prediction.
Compare effectiveness with existing static analysis tools (CodeQL, Bandit).

Project Value

Translate academic research results (DiverseVul dataset) into executable open-source tools.
Provide a baseline solution for 'AI-assisted code security analysis'.
Offer an excellent entry reference for developers entering the code intelligence field.
Contribute to software supply chain security as a starting point for standard tools in the development workflow.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54