# Building a RAG System from Scratch: Implementing Retrieval-Augmented Generation on the vLLM Codebase

> An implementation of a RAG pipeline based on BM25 retrieval and local large language models, designed specifically for codebase question answering, supporting precise source code references and validation of evaluation metrics.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-11T11:11:15.000Z
- 最近活动: 2026-06-11T11:19:21.840Z
- 热度: 159.9
- 关键词: RAG, BM25, vLLM, 代码检索, 本地LLM, Qwen, 检索增强生成, 代码问答
- 页面链接: https://www.zingnex.cn/en/forum/thread/rag-vllm
- Canonical: https://www.zingnex.cn/forum/thread/rag-vllm
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Building a RAG System from Scratch: Implementing Retrieval-Augmented Generation on the vLLM Codebase

An implementation of a RAG pipeline based on BM25 retrieval and local large language models, designed specifically for codebase question answering, supporting precise source code references and validation of evaluation metrics.

## Original Author and Source

- **Original Author/Maintainer**: marco-kraemer (42 course project, collaborator msantos2)
- **Source Platform**: GitHub
- **Original Title**: RAG_against_the_machine
- **Original Link**: https://github.com/marco-kraemer/RAG_against_the_machine
- **Publication Date**: June 11, 2026

---

## Project Background and Motivation

In the era of rapid development of large language models (LLMs), how to enable AI to accurately answer questions about specific codebases has become a key challenge. Although general-purpose LLMs have extensive knowledge, they often suffer from 'hallucinations' when dealing with specific project code—generating answers that seem reasonable but are actually incorrect.

Retrieval-Augmented Generation (RAG) technology emerged as a solution; it combines external knowledge retrieval with text generation, allowing models to cite real sources when answering questions. This project is a complete implementation of this technology in the codebase question-answering scenario, building an end-to-end RAG pipeline for the codebase of vLLM, a popular inference framework.

---

## System Architecture Overview

The entire RAG system consists of three core modules, forming a complete data flow loop:

## 1. Indexer

The indexer is responsible for converting the original codebase into searchable structured data. It traverses the vllm-0.10.1 directory, reads all .py and .md files, and uses RecursiveCharacterTextSplitter for intelligent chunking.

For different file types, the project adopts differentiated chunking strategies:

- **Python code files**: Use language-specific separators (e.g., `\nclass `, `\ndef `, `\n\tdef `, etc.) to ensure chunk boundaries align with code structures (classes, functions), and set a 50% overlap rate to prevent truncation of definitions
- **Markdown documents**: Adopt the default hierarchical separation strategy (paragraph → line → word → character), with a 10% overlap rate sufficient for natural language text

This differentiated processing ensures the integrity of code semantic units, while `add_start_index=True` records the exact character offset of each chunk in the source file, laying the foundation for subsequent source references.

## 2. Retriever

The retriever is implemented based on the BM25 algorithm, using the bm25s library for efficient lexical retrieval. BM25 (Best Matching 25) is a classic improved algorithm of Term Frequency-Inverse Document Frequency (TF-IDF), which performs excellently in code retrieval scenarios because it excels at matching precise keywords such as function names and variable names.

When a user submits a query, the system tokenizes the query, then retrieves the top-k most relevant chunks from the BM25 index and maps them back to the character offset range (first_character_index, last_character_index) of the original file.

## 3. Generator

The generator uses the locally deployed Qwen/Qwen2.5-0.5B-Instruct model (approximately 600 million parameters). It receives the context provided by the retriever, directly reads the extended window around the chunks from the source files, constructs prompts containing real code snippets, and finally generates evidence-based natural language answers.

---

## Efficient Implementation of BM25s

The project chose the bm25s library instead of traditional solutions because it is written in Rust/C, optimized specifically for the BM25 algorithm, and can achieve fast retrieval on large codebases (such as vLLM) without excessive memory overhead.