Reading

Python RAG Vault: A Hybrid Retrieval-Augmented Generation System for Obsidian Note Libraries

A hybrid RAG system combining vector databases and knowledge graphs, designed specifically for Obsidian note libraries, supporting multi-format document indexing, intelligent chunking, semantic search, and conversation history management

RAGObsidian知识图谱向量数据库ChromaDBLLM知识库问答Python

Published 2026-06-14 06:44Recent activity 2026-06-14 06:49Estimated read 6 min

Section 01

Introduction / Main Post: Python RAG Vault: A Hybrid Retrieval-Augmented Generation System for Obsidian Note Libraries

Section 02

Original Author and Source

Original Author/Maintainer: faielli
Source Platform: GitHub
Original Title: Python-RAG-vault
Original Link: https://github.com/faielli/Python-RAG-vault
Release/Update Date: 2026-06-13

Section 03

Project Overview

Python-RAG-vault is a hybrid Retrieval-Augmented Generation (RAG) system designed specifically for Obsidian note libraries. Unlike traditional pure vector retrieval solutions, this project innovatively combines two technical approaches: vector databases and knowledge graphs, providing users with a more comprehensive and accurate document Q&A experience.

The core positioning of this system is to serve learning and knowledge management scenarios—whether it's class notes, professional books, or training materials, users can ask questions in natural language, and the system will retrieve relevant information from the local knowledge base and generate accurate answers.

Section 04

Modular Architecture

The project adopts a clear modular design and manages components through a dependency injection pattern:

app.py: Flask application entry point, responsible for routing configuration and frontend services
rag_core.py: Core logic module, including text extraction, chunking, embedding, vector storage, and LLM calls
upload_handler.py: Blueprint for temporary file RAG processing, supporting instant upload and query
model_switcher.py: Runtime model switching without restarting the service
frontend.html: Single-page web interface

This design allows each component to be tested and maintained independently, while also facilitating future function expansion.

Section 05

Hybrid Retrieval Strategy

The system's biggest highlight is its hybrid retrieval mechanism. It not only uses ChromaDB for vector similarity search but also builds a knowledge graph to capture entity relationships between documents:

Vector Retrieval Part:

Uses the all-MiniLM-L6-v2 model to generate 384-dimensional text embeddings
Supports the code-specific embedding model flax-sentence-embeddings/st-codesearch-distilroberta-base
By default retrieves the top 2 most similar text chunks

Knowledge Graph Part:

Extracts "subject-relation-object" triples from documents via LLM
Builds a directed graph to represent associations between entities
Expands to one-hop neighbors of relevant entities during queries
Returns associated source files and relational text

The two retrieval results are fused and input into the LLM, ensuring both semantic relevance and the use of structured knowledge.

Section 06

Multi-Format Support

The system supports automatic parsing of multiple document formats:

Format	Processing Method
Markdown / TXT	Direct reading
PDF	PyMuPDF + Tesseract OCR fallback
DOCX	Parsed via python-docx library
EPUB	ebooklib + BeautifulSoup
ODT / ODS	Processed via odfpy library
HTML / HTM	Extract main content using BeautifulSoup

For scanned PDFs, the system automatically calls Tesseract OCR for text recognition, supporting bilingual configuration for Italian and English.

Section 07

Intelligent Chunking Strategy

Document chunking uses a sliding window mechanism:

Default chunk size: 500 characters
Overlap area: 50 characters
This design ensures semantic coherence across chunk boundaries

Section 08

Incremental Indexing

The system maintains a file modification time mapping and supports incremental updates. Only files with changed modification times or newly added files are re-indexed, greatly improving the efficiency of repeated indexing.

Python RAG Vault: A Hybrid Retrieval-Augmented Generation System for Obsidian Note Libraries

Introduction / Main Post: Python RAG Vault: A Hybrid Retrieval-Augmented Generation System for Obsidian Note Libraries

Original Author and Source

Project Overview

Modular Architecture

Hybrid Retrieval Strategy

Multi-Format Support

Intelligent Chunking Strategy

Incremental Indexing

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization