The indexer is responsible for converting the original codebase into searchable structured data. It traverses the vllm-0.10.1 directory, reads all .py and .md files, and uses RecursiveCharacterTextSplitter for intelligent chunking.
For different file types, the project adopts differentiated chunking strategies:
- Python code files: Use language-specific separators (e.g.,
\nclass , \ndef , \n\tdef , etc.) to ensure chunk boundaries align with code structures (classes, functions), and set a 50% overlap rate to prevent truncation of definitions
- Markdown documents: Adopt the default hierarchical separation strategy (paragraph → line → word → character), with a 10% overlap rate sufficient for natural language text
This differentiated processing ensures the integrity of code semantic units, while add_start_index=True records the exact character offset of each chunk in the source file, laying the foundation for subsequent source references.