Zing Forum

Reading

Building a RAG System from Scratch: Implementing Retrieval-Augmented Generation on the vLLM Codebase

An implementation of a RAG pipeline based on BM25 retrieval and local large language models, designed specifically for codebase question answering, supporting precise source code references and validation of evaluation metrics.

RAGBM25vLLM代码检索本地LLMQwen检索增强生成代码问答
Published 2026-06-11 19:11Recent activity 2026-06-11 19:19Estimated read 6 min
Building a RAG System from Scratch: Implementing Retrieval-Augmented Generation on the vLLM Codebase
1

Section 01

Introduction / Main Floor: Building a RAG System from Scratch: Implementing Retrieval-Augmented Generation on the vLLM Codebase

An implementation of a RAG pipeline based on BM25 retrieval and local large language models, designed specifically for codebase question answering, supporting precise source code references and validation of evaluation metrics.

2

Section 02

Original Author and Source


3

Section 03

Project Background and Motivation

In the era of rapid development of large language models (LLMs), how to enable AI to accurately answer questions about specific codebases has become a key challenge. Although general-purpose LLMs have extensive knowledge, they often suffer from 'hallucinations' when dealing with specific project code—generating answers that seem reasonable but are actually incorrect.

Retrieval-Augmented Generation (RAG) technology emerged as a solution; it combines external knowledge retrieval with text generation, allowing models to cite real sources when answering questions. This project is a complete implementation of this technology in the codebase question-answering scenario, building an end-to-end RAG pipeline for the codebase of vLLM, a popular inference framework.


4

Section 04

System Architecture Overview

The entire RAG system consists of three core modules, forming a complete data flow loop:

5

Section 05

1. Indexer

The indexer is responsible for converting the original codebase into searchable structured data. It traverses the vllm-0.10.1 directory, reads all .py and .md files, and uses RecursiveCharacterTextSplitter for intelligent chunking.

For different file types, the project adopts differentiated chunking strategies:

  • Python code files: Use language-specific separators (e.g., \nclass , \ndef , \n\tdef , etc.) to ensure chunk boundaries align with code structures (classes, functions), and set a 50% overlap rate to prevent truncation of definitions
  • Markdown documents: Adopt the default hierarchical separation strategy (paragraph → line → word → character), with a 10% overlap rate sufficient for natural language text

This differentiated processing ensures the integrity of code semantic units, while add_start_index=True records the exact character offset of each chunk in the source file, laying the foundation for subsequent source references.

6

Section 06

2. Retriever

The retriever is implemented based on the BM25 algorithm, using the bm25s library for efficient lexical retrieval. BM25 (Best Matching 25) is a classic improved algorithm of Term Frequency-Inverse Document Frequency (TF-IDF), which performs excellently in code retrieval scenarios because it excels at matching precise keywords such as function names and variable names.

When a user submits a query, the system tokenizes the query, then retrieves the top-k most relevant chunks from the BM25 index and maps them back to the character offset range (first_character_index, last_character_index) of the original file.

7

Section 07

3. Generator

The generator uses the locally deployed Qwen/Qwen2.5-0.5B-Instruct model (approximately 600 million parameters). It receives the context provided by the retriever, directly reads the extended window around the chunks from the source files, constructs prompts containing real code snippets, and finally generates evidence-based natural language answers.


8

Section 08

Efficient Implementation of BM25s

The project chose the bm25s library instead of traditional solutions because it is written in Rust/C, optimized specifically for the BM25 algorithm, and can achieve fast retrieval on large codebases (such as vLLM) without excessive memory overhead.