# Python RAG Vault: A Hybrid Retrieval-Augmented Generation System for Obsidian Note Libraries

> A hybrid RAG system combining vector databases and knowledge graphs, designed specifically for Obsidian note libraries, supporting multi-format document indexing, intelligent chunking, semantic search, and conversation history management

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-13T22:44:40.000Z
- 最近活动: 2026-06-13T22:49:15.194Z
- 热度: 159.9
- 关键词: RAG, Obsidian, 知识图谱, 向量数据库, ChromaDB, LLM, 知识库问答, Python
- 页面链接: https://www.zingnex.cn/en/forum/thread/python-rag-vault-obsidian
- Canonical: https://www.zingnex.cn/forum/thread/python-rag-vault-obsidian
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: Python RAG Vault: A Hybrid Retrieval-Augmented Generation System for Obsidian Note Libraries

A hybrid RAG system combining vector databases and knowledge graphs, designed specifically for Obsidian note libraries, supporting multi-format document indexing, intelligent chunking, semantic search, and conversation history management

## Original Author and Source

- **Original Author/Maintainer:** faielli
- **Source Platform:** GitHub
- **Original Title:** Python-RAG-vault
- **Original Link:** https://github.com/faielli/Python-RAG-vault
- **Release/Update Date:** 2026-06-13

## Project Overview

Python-RAG-vault is a hybrid Retrieval-Augmented Generation (RAG) system designed specifically for Obsidian note libraries. Unlike traditional pure vector retrieval solutions, this project innovatively combines two technical approaches: vector databases and knowledge graphs, providing users with a more comprehensive and accurate document Q&A experience.

The core positioning of this system is to serve learning and knowledge management scenarios—whether it's class notes, professional books, or training materials, users can ask questions in natural language, and the system will retrieve relevant information from the local knowledge base and generate accurate answers.

## Modular Architecture

The project adopts a clear modular design and manages components through a dependency injection pattern:

- **app.py**: Flask application entry point, responsible for routing configuration and frontend services
- **rag_core.py**: Core logic module, including text extraction, chunking, embedding, vector storage, and LLM calls
- **upload_handler.py**: Blueprint for temporary file RAG processing, supporting instant upload and query
- **model_switcher.py**: Runtime model switching without restarting the service
- **frontend.html**: Single-page web interface

This design allows each component to be tested and maintained independently, while also facilitating future function expansion.

## Hybrid Retrieval Strategy

The system's biggest highlight is its hybrid retrieval mechanism. It not only uses ChromaDB for vector similarity search but also builds a knowledge graph to capture entity relationships between documents:

**Vector Retrieval Part:**
- Uses the all-MiniLM-L6-v2 model to generate 384-dimensional text embeddings
- Supports the code-specific embedding model flax-sentence-embeddings/st-codesearch-distilroberta-base
- By default retrieves the top 2 most similar text chunks

**Knowledge Graph Part:**
- Extracts "subject-relation-object" triples from documents via LLM
- Builds a directed graph to represent associations between entities
- Expands to one-hop neighbors of relevant entities during queries
- Returns associated source files and relational text

The two retrieval results are fused and input into the LLM, ensuring both semantic relevance and the use of structured knowledge.

## Multi-Format Support

The system supports automatic parsing of multiple document formats:

| Format | Processing Method |
|--------|-------------------|
| Markdown / TXT | Direct reading |
| PDF | PyMuPDF + Tesseract OCR fallback |
| DOCX | Parsed via python-docx library |
| EPUB | ebooklib + BeautifulSoup |
| ODT / ODS | Processed via odfpy library |
| HTML / HTM | Extract main content using BeautifulSoup |

For scanned PDFs, the system automatically calls Tesseract OCR for text recognition, supporting bilingual configuration for Italian and English.

## Intelligent Chunking Strategy

Document chunking uses a sliding window mechanism:
- Default chunk size: 500 characters
- Overlap area: 50 characters
- This design ensures semantic coherence across chunk boundaries

## Incremental Indexing

The system maintains a file modification time mapping and supports incremental updates. Only files with changed modification times or newly added files are re-indexed, greatly improving the efficiency of repeated indexing.
