# VaultRAG: A Hybrid RAG System for Obsidian Notes Combining Vector Retrieval and Knowledge Graph

> A hybrid RAG system designed specifically for Obsidian note libraries, integrating vector retrieval and knowledge graph technologies. It supports multi-format document processing, intelligent chunking, multi-model switching, and knowledge graph-based query expansion, providing powerful AI Q&A capabilities for personal knowledge management.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-13T22:44:40.000Z
- 最近活动: 2026-06-13T22:52:45.761Z
- 热度: 150.9
- 关键词: RAG, Obsidian, 知识图谱, 向量检索, Flask, 知识管理, LLM, Python
- 页面链接: https://www.zingnex.cn/en/forum/thread/vaultrag-obsidianrag
- Canonical: https://www.zingnex.cn/forum/thread/vaultrag-obsidianrag
- Markdown 来源: floors_fallback

---

## [Introduction] VaultRAG: Core Introduction to the Hybrid RAG System for Obsidian Notes

### Core Introduction to VaultRAG

VaultRAG is a hybrid Retrieval-Augmented Generation (RAG) system designed specifically for Obsidian note libraries. It integrates vector retrieval and knowledge graph technologies to provide powerful AI Q&A capabilities for personal knowledge management.

**Basic Information**:
- Original author/maintainer: faielli
- Source platform: GitHub
- Release date: June 13, 2026
- Project link: [Python-RAG-vault](https://github.com/faielli/Python-RAG-vault)

**Core Features**: Supports multi-format document processing, intelligent chunking, multi-model switching, incremental indexing, and knowledge graph-based query expansion.

## Project Background and Positioning

### Project Background and Positioning

VaultRAG addresses the needs of Obsidian users (researchers, students, knowledge workers) who manage large volumes of notes, literature, and learning materials. It provides a solution to transform static note libraries into interactive knowledge bases. As a hybrid RAG system, it combines vector retrieval and knowledge graph technologies to overcome the limitations of pure vector retrieval in complex relational reasoning.

## Core Architecture and Hybrid Retrieval Mechanism

### Core Architecture and Hybrid Retrieval Mechanism

#### Modular Architecture
The system uses a dependency injection pattern to decouple components. The core modules are divided as follows:
| Module | Responsibility |
|------|------|
| `app.py` | Flask entry point, responsible for configuration, routing, and frontend services |
| `rag_core.py` | Core logic: text extraction, chunking, embedding, ChromaDB management, knowledge graph construction, LLM calls |
| `upload_handler.py` | Flask blueprint for temporary file RAG processing (no persistence) |
| `model_switcher.py` | Runtime model switching (no need to restart the application) |
| `frontend.html` | Single-page application frontend interface |

#### Hybrid Retrieval Strategy
- **Vector Retrieval Layer**: Uses the `all-MiniLM-L6-v2` embedding model by default (code can be switched to `flax-sentence-embeddings/st-codesearch-distilroberta-base`). Documents are split into 500-character chunks (with 50-character overlap).
- **Knowledge Graph Layer**: 
 1. Sample 3 chunks from each document, extract up to 15 triples (subject | relation | object) via LLM;
 2. Supports incremental construction (only processes new files);
 3. Query expansion: Tokenization → calculate node overlap score → select Top-N seeds → expand 1-hop neighbors → collect related source files and relational text.

## Multi-format Support and Intelligent Features

### Multi-format Support and Intelligent Features

#### Multi-format Document Processing
| Format | Processing Method |
|------|----------|
| Markdown, TXT | Direct reading |
| PDF | Text extraction via PyMuPDF; fallback to Tesseract OCR (200 DPI) for scanned versions |
| DOCX | Parsed with python-docx |
| EPUB | Extract HTML content using ebooklib + BeautifulSoup |
| ODT, ODS | Parsed with odfpy |
| HTML, HTM | Extract plain text with BeautifulSoup |
*Note: Supports OCR for mixed Italian-English documents (`ita+eng` language configuration).*

#### Intelligent Features
- **Incremental Indexing**: Skips unmodified files via `{path: mtime}` mapping;
- **Duplicate Detection**: Identifies duplicate content with a cosine similarity threshold of `dup_threshold=0.97`;
- **Conversation History**: Retains the last 20 rounds and automatically saves as Markdown with YAML frontmatter to `_chat/`;
- **Discipline Filtering**: Filters by discipline/folder, falls back to global search if no results are found.

## Key Technical Configuration Points

### Key Technical Configuration Points

#### LLM Configuration
- Default model: `qwen-plus`
- API endpoint: OpenRouter (compatible with OpenAI API format)
- Max tokens: 8192
- Supports runtime model switching (no need to restart the service)

#### Embedding Model Recommendation
For scenarios where Italian text is dominant, it is recommended to use `multilingual-e5-large` instead of the default `all-MiniLM-L6-v2` to improve multilingual semantic understanding capabilities.

## Use Cases and Value Proposition

### Use Cases and Value Proposition

VaultRAG is suitable for the following scenarios:
1. **Academic Research**: Quickly locate relevant concepts and citations in literature notes;
2. **Course Learning**: Integrate courseware, textbooks, and notes to build a personal learning assistant;
3. **Project Knowledge Management**: Unified retrieval of technical documents and code notes;
4. **Writing Assistance**: Create content based on existing materials, ensuring accurate citations.

## Summary and Insights

### Summary and Insights

VaultRAG provides a typical paradigm for RAG applications in the field of personal knowledge management:
- **Hybrid architecture** is key to improving retrieval quality,弥补ing the lack of relational reasoning in pure vector retrieval;
- **Incremental processing** and **duplicate detection** are essential capabilities for practical systems;
- **Multi-format support** lowers the threshold for building knowledge bases;
- **Modular design** facilitates maintenance and expansion.

For users who want to AI-enable their Obsidian note libraries, VaultRAG is a fully functional and architecturally clear reference implementation.
