CodeBase RAGBot adopts a classic three-layer RAG architecture, splitting the codebase understanding process into two phases: offline indexing and online querying:
Layer 1: Data Ingestion and Processing
When a user enters a GitHub repository URL, the system first clones the repository locally using the GitPython library. Then, code files are parsed and split into reasonable granularities—for large files, the system intelligently splits them into multiple semantically complete code blocks, ensuring each block contains sufficient contextual information without exceeding the token limit for subsequent processing.
Layer 2: Vectorization and Indexing
The split code blocks are converted into high-dimensional vector representations via the Sentence Transformers model. These vectors are stored in the Pinecone vector database to build an efficiently retrievable semantic index. This design ensures that semantically similar code (even with different variable names or implementation methods) is close in the vector space, thus supporting semantic-based retrieval.
Layer 3: Retrieval and Generation
When a user asks a question, the system first vectorizes the question and retrieves the most relevant code snippets from Pinecone. These snippets, along with the original question, are assembled into a carefully designed prompt and sent to the Llama 3.1 70B model hosted on the Groq platform. The large language model generates answers based on the retrieved code context, enabling "evidence-based" code explanations.