# Reproducibility Study of Vul-RAG: Performance Bottlenecks of Open-Weight Models in Vulnerability Detection

> A reproducibility study on the RAG-based vulnerability detection framework reveals that even with the latest large language models, there remains a pairwise accuracy bottleneck of approximately 0.30 in vulnerability detection, which is hard to break by simply scaling up the model size.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-03T11:20:37.000Z
- 最近活动: 2026-06-04T05:18:15.457Z
- 热度: 118.0
- 关键词: 漏洞检测, RAG, 可复现性, 开源模型, 软件安全
- 页面链接: https://www.zingnex.cn/en/forum/thread/vul-rag
- Canonical: https://www.zingnex.cn/forum/thread/vul-rag
- Markdown 来源: floors_fallback

---

## Introduction to Vul-RAG Reproducibility Study: Performance Bottlenecks of Open-Weight Models in Vulnerability Detection

### Original Authors & Source
- **Original Author/Team**: IT Security Research Team at Esslingen University of Applied Sciences, Germany
- **Source Platform**: arXiv
- **Original Title**: Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models
- **Original Link**: http://arxiv.org/abs/2606.04739v1
- **Publication Date**: June 3, 2026
- **Open-Source Code**: https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG

### Core Insights
A reproducibility study on the RAG-based vulnerability detection framework reveals that even with the latest open-weight models, the pairwise accuracy of vulnerability detection still has a bottleneck of around 0.30, which is difficult to break by simply increasing the model size. The study explores the reproducibility and transferability issues of the Vul-RAG framework, providing key references for model applications in the software security field.

## Research Background and Motivation

Large language models combined with Retrieval-Augmented Generation (RAG) technology show great potential in the field of software vulnerability detection. Vul-RAG is a typical RAG framework that improves detection capabilities by injecting high-level vulnerability knowledge. However, many current studies rely on proprietary models and APIs, leading to doubts about the **reproducibility** and **transferability** of results.

Core question: Does the excellent performance of Vul-RAG stem from the effectiveness of the method itself, or only from the use of specific closed-source models? Will the results still hold when replaced with open-weight models?

## Reproducibility Method Design

The study adopts a systematic reproducibility strategy, divided into two phases:

### Phase 1: Strict Reproducibility
In a local environment, use the open-source baseline models reported in the paper (such as CodeLlama, DeepSeek-Coder, etc.) to reproduce the original results and verify the reproducibility of the basic method.

### Phase 2: Extended Evaluation
Extend to a broader set of models, including:
- Code-specific models (StarCoder, CodeQwen)
- General-purpose large models (Llama3, Qwen2.5)
- Reasoning models (DeepSeek-R1, Qwen-QwQ)
- Variants of different parameter scales (4B to 70B)

Comprehensive evaluation of the method's sensitivity to model selection.

## Key Findings: Existence of Performance Bottlenecks

### 0.30 Pairwise Accuracy Ceiling
Among all tested models, the pairwise accuracy (ability to correctly identify both vulnerable code and fixed code) stabilizes at around 0.30. Even models with larger parameter scales, newer training data, and more advanced architectures cannot break this bottleneck; increasing the model size from 7B to 70B brings minimal performance improvement.

### Deep Reasons for the Bottleneck
Current RAG-enhanced vulnerability detection methods may have fundamental limitations:
1. **Retrieval Quality Limitation**: RAG effectiveness highly depends on the quality of retrieved vulnerability knowledge
2. **Context Understanding Limitation**: Models struggle to accurately locate vulnerability patterns in complex code
3. **Training Data Bias**: The distribution of vulnerability samples in pre-training data is insufficient to support more fine-grained detection

## Comparative Analysis of Model Characteristics

### Code-Specific vs. General-Purpose Models
Code-specific models (e.g., StarCoder) have advantages in code understanding tasks, but their advantages are significantly weakened in vulnerability detection; general-purpose models can reach similar levels with appropriate prompt engineering.

### Reasoning Model Performance
Specialized reasoning models (e.g., DeepSeek-R1) do not show the expected advantages in vulnerability detection, possibly because vulnerability detection relies more on pattern recognition than step-by-step reasoning.

### Quantization and Efficiency Trade-off
4-bit quantization significantly reduces deployment costs while maintaining most of the performance.

## Practical Implications and Future Directions

### Recommendations for Security Practitioners
1. **No need to pursue the largest model**: The marginal gain of 70B models over 7B models is limited; prioritize inference costs
2. **Focus on RAG system quality**: Improving the quality of the retrieval component is more effective than replacing with a stronger LLM
3. **Combine with traditional static analysis**: LLM detection should be a supplement rather than a replacement for traditional tools like CodeQL and Semgrep

### Research Direction Suggestions
- **Fine-grained vulnerability localization**: Move from function-level to code line-level
- **Multimodal fusion**: Combine multi-source data such as code change history and commit information
- **Domain adaptation**: Customize detection strategies for specific programming languages or frameworks