# Open-Source Vulnerability Auto-Detection Based on BERT: From DiverseVul Dataset to Practical Applications

> An open-source project developed by Sameera Ali that uses the BERT model for three-stage progressive training on the DiverseVul dataset to achieve automated vulnerability detection in C/C++ code, providing a practical AI solution for open-source software security.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-24T09:12:11.000Z
- 最近活动: 2026-05-24T09:22:32.899Z
- 热度: 145.8
- 关键词: 漏洞檢測, BERT, DiverseVul, 程式碼安全, 靜態分析, 開源安全, PyTorch, HuggingFace, C/C++, 機器學習
- 页面链接: https://www.zingnex.cn/en/forum/thread/bert-diversevul
- Canonical: https://www.zingnex.cn/forum/thread/bert-diversevul
- Markdown 来源: floors_fallback

---

## Introduction to the BERT-Based Open-Source Vulnerability Auto-Detection Project

An open-source project developed by Sameera Ali that uses the BERT model for three-stage progressive training on the DiverseVul dataset to achieve automated vulnerability detection in C/C++ code, providing a practical AI solution for open-source software security.
- Original Author/Maintainer: Sameera Ali
- Source Platform: GitHub
- Original Title: LLM_vulnerability_detection-open_source_vulnerability_detection
- Original Link: https://github.com/alisameera/LLM_vulnerability_detection-open_source_vulnerability_detection
- Source Publication Date: 2026-05-24

## Project Background and Problem Awareness

In the open-source software ecosystem, vulnerability detection has always been a time-consuming task requiring high professional knowledge. While traditional static analysis tools can detect common vulnerability patterns, they fall short when facing complex program logic and new attack methods. With the maturity of large language model (LLM) technology, using AI for automated vulnerability detection has become a promising research direction.
This project selects the DiverseVul dataset as the training foundation, which contains real-world function-level vulnerability annotations for C/C++ and covers various common security vulnerability types.

## Core Technical Architecture and Training Strategy

### Model Selection
Adopts `bert-base-uncased` as the base model for the following reasons: it is general-purpose and easily accessible, effectively validates the core hypothesis of 'transferring natural language processing technology to code analysis', and leaves room for subsequent comparison experiments with CodeBERT.

### Three-Stage Progressive Training
1. **Pilot (Pilot Verification)**：200 data samples, 1 epoch, maximum sequence length of 64 tokens, quickly verifying pipeline feasibility.
2. **Scaled (Scale Expansion)**：5000 data samples, 2 epochs, maximum sequence length of 128 tokens, adjusting hyperparameters to capture complex vulnerability patterns.
3. **Full (Full Training)**：Complete DiverseVul dataset, 3 epochs, maximum sequence length of 512 tokens, achieving optimal detection performance.
The advantage of this strategy: reduces the risk of experimental failure and waste of computing resources.

## Data Processing Workflow and Tech Stack

### Data Processing Workflow
1. Raw data loading: Read C/C++ function code and vulnerability labels from DiverseVul CSV.
2. Preprocessing: Remove null values, format strings, label encoding (0=safe, 1=vulnerable).
3. Tokenization: Use `bert-base-uncased` tokenizer for word segmentation (padding + truncation).
4. Model fine-tuning: Use `AutoModelForSequenceClassification` for binary classification training.
5. Evaluation: 70/30 training-test split, using accuracy and weighted F1 score as metrics.

### Tech Stack
- Deep learning framework: PyTorch
- NLP tools: HuggingFace Transformers, Datasets, Evaluate
- Data processing: Pandas
- Development environment: Jupyter Notebook, Python 3.x

## Vulnerability Detection Capabilities and Current Limitations

### Detection Capability Scope
Able to identify various common vulnerability types:
- Hardcoded credentials
- SQL injection
- Buffer overflow
- Cross-site scripting (XSS)
- Use of outdated/unsafe functions (e.g., strcpy, gets)

### Current Limitations
- BERT's context window limitation makes it difficult to capture cross-function vulnerability patterns.
- May generate false positives for safe code similar to vulnerability patterns.
- Using CodeBERT or domain-specific pre-training can further improve the F1 score.

## Future Outlook and Project Value Summary

### Future Directions
- Replace BERT with CodeBERT to obtain more suitable embedding representations for code.
- Integrate into CI/CD pipeline to实现 real-time Pull Request scanning.
- Add CWE category classification to achieve multi-label prediction.
- Compare effectiveness with existing static analysis tools (CodeQL, Bandit).

### Project Value
- Translate academic research results (DiverseVul dataset) into executable open-source tools.
- Provide a baseline solution for 'AI-assisted code security analysis'.
- Offer an excellent entry reference for developers entering the code intelligence field.
- Contribute to software supply chain security as a starting point for standard tools in the development workflow.
