Zing Forum

Reading

Reproducibility Study of Vul-RAG: Performance Bottlenecks of Open-Weight Models in Vulnerability Detection

A reproducibility study on the RAG-based vulnerability detection framework reveals that even with the latest large language models, there remains a pairwise accuracy bottleneck of approximately 0.30 in vulnerability detection, which is hard to break by simply scaling up the model size.

漏洞检测RAG可复现性开源模型软件安全
Published 2026-06-03 19:20Recent activity 2026-06-04 13:18Estimated read 8 min
Reproducibility Study of Vul-RAG: Performance Bottlenecks of Open-Weight Models in Vulnerability Detection
1

Section 01

Introduction to Vul-RAG Reproducibility Study: Performance Bottlenecks of Open-Weight Models in Vulnerability Detection

Original Authors & Source

Core Insights

A reproducibility study on the RAG-based vulnerability detection framework reveals that even with the latest open-weight models, the pairwise accuracy of vulnerability detection still has a bottleneck of around 0.30, which is difficult to break by simply increasing the model size. The study explores the reproducibility and transferability issues of the Vul-RAG framework, providing key references for model applications in the software security field.

2

Section 02

Research Background and Motivation

Large language models combined with Retrieval-Augmented Generation (RAG) technology show great potential in the field of software vulnerability detection. Vul-RAG is a typical RAG framework that improves detection capabilities by injecting high-level vulnerability knowledge. However, many current studies rely on proprietary models and APIs, leading to doubts about the reproducibility and transferability of results.

Core question: Does the excellent performance of Vul-RAG stem from the effectiveness of the method itself, or only from the use of specific closed-source models? Will the results still hold when replaced with open-weight models?

3

Section 03

Reproducibility Method Design

The study adopts a systematic reproducibility strategy, divided into two phases:

Phase 1: Strict Reproducibility

In a local environment, use the open-source baseline models reported in the paper (such as CodeLlama, DeepSeek-Coder, etc.) to reproduce the original results and verify the reproducibility of the basic method.

Phase 2: Extended Evaluation

Extend to a broader set of models, including:

  • Code-specific models (StarCoder, CodeQwen)
  • General-purpose large models (Llama3, Qwen2.5)
  • Reasoning models (DeepSeek-R1, Qwen-QwQ)
  • Variants of different parameter scales (4B to 70B)

Comprehensive evaluation of the method's sensitivity to model selection.

4

Section 04

Key Findings: Existence of Performance Bottlenecks

0.30 Pairwise Accuracy Ceiling

Among all tested models, the pairwise accuracy (ability to correctly identify both vulnerable code and fixed code) stabilizes at around 0.30. Even models with larger parameter scales, newer training data, and more advanced architectures cannot break this bottleneck; increasing the model size from 7B to 70B brings minimal performance improvement.

Deep Reasons for the Bottleneck

Current RAG-enhanced vulnerability detection methods may have fundamental limitations:

  1. Retrieval Quality Limitation: RAG effectiveness highly depends on the quality of retrieved vulnerability knowledge
  2. Context Understanding Limitation: Models struggle to accurately locate vulnerability patterns in complex code
  3. Training Data Bias: The distribution of vulnerability samples in pre-training data is insufficient to support more fine-grained detection
5

Section 05

Comparative Analysis of Model Characteristics

Code-Specific vs. General-Purpose Models

Code-specific models (e.g., StarCoder) have advantages in code understanding tasks, but their advantages are significantly weakened in vulnerability detection; general-purpose models can reach similar levels with appropriate prompt engineering.

Reasoning Model Performance

Specialized reasoning models (e.g., DeepSeek-R1) do not show the expected advantages in vulnerability detection, possibly because vulnerability detection relies more on pattern recognition than step-by-step reasoning.

Quantization and Efficiency Trade-off

4-bit quantization significantly reduces deployment costs while maintaining most of the performance.

6

Section 06

Practical Implications and Future Directions

Recommendations for Security Practitioners

  1. No need to pursue the largest model: The marginal gain of 70B models over 7B models is limited; prioritize inference costs
  2. Focus on RAG system quality: Improving the quality of the retrieval component is more effective than replacing with a stronger LLM
  3. Combine with traditional static analysis: LLM detection should be a supplement rather than a replacement for traditional tools like CodeQL and Semgrep

Research Direction Suggestions

  • Fine-grained vulnerability localization: Move from function-level to code line-level
  • Multimodal fusion: Combine multi-source data such as code change history and commit information
  • Domain adaptation: Customize detection strategies for specific programming languages or frameworks