Reading

Building a Production-Grade RAG System from Scratch: Practical Implementation of HNSW-Based Vector Retrieval and Multi-Stage Recall Pipeline

This article provides an in-depth analysis of a complete local-first RAG chatbot backend implementation, covering core mechanisms such as HNSW approximate nearest neighbor indexing, hybrid retrieval (dense vectors + BM25), cross-encoder re-ranking, and MMR diversity deduplication. It also offers detailed performance benchmark data and architectural design insights.

RAGHNSW向量检索近似最近邻混合检索BM25交叉编码器重排序本地LLMOllama

Published 2026-03-30 07:07Recent activity 2026-03-30 07:22Estimated read 6 min

Building a Production-Grade RAG System from Scratch: Practical Implementation of HNSW-Based Vector Retrieval and Multi-Stage Recall Pipeline

Section 01

Introduction: Core Practices for Building a Production-Grade Local-First RAG System from Scratch

This article introduces a complete local-first RAG chatbot backend project, covering core mechanisms like HNSW vector indexing, hybrid retrieval (dense vectors + BM25), cross-encoder re-ranking, and MMR diversity deduplication. It provides detailed performance benchmark data and architectural design insights, aiming to address the knowledge timeliness and hallucination issues in LLM applications. All embedding calculations and text generation are done locally to ensure user data privacy.

Section 02

Project Background: Pain Points of LLM Applications and Limitations of Existing Solutions

In LLM application development, RAG has become a standard solution to address knowledge timeliness and hallucination issues. However, most tutorials only cover simple vector similarity search and lack discussions on key production environment issues: How to achieve low-latency retrieval while ensuring recall rate? How to handle incremental updates of massive documents? How to balance semantic understanding and keyword matching? This project provides a complete solution to these problems.

Section 03

Core Methods and System Architecture

The system uses FastAPI to build a RESTful API backend, with the core design concept of "local-first". The architecture is divided into three layers:

Document Ingestion Layer: Supports loading, cleaning, chunking (token/sentence mode, default 450-token window +80-token overlap), and vectorization of multi-format documents;
Vector Storage Layer: Isolates user-specific corpora, supports document version management and soft deletion;
Retrieval Service Layer: Implements a five-stage pipeline: Multi-query Rewriting → Hybrid Retrieval (HNSW dense + BM25 sparse, fusion formula: score =0.65×semantic +0.35×keyword) → MMR Diversity Deduplication → Cross-Encoder Re-ranking → LLM Generation (local Ollama call to Qwen2.5:7B-Instruct).

Section 04

Performance Benchmark: HNSW vs. Brute-force Search

The project provides benchmark scripts, tested on an HP EliteBook 840 G4 with different corpus sizes:

When N=25000: HNSW reduces latency by 13.8x compared to brute-force search (median:0.656ms vs9.071ms), with Recall@5 remaining at100%;
When N=50000: Speed advantage is13.07x, but recall rate drops to60% (need to increase ef_search parameter);
When N<500: Brute-force search is faster, as HNSW's fixed overhead exceeds linear scan cost.

Section 05

Project Summary and Future Outlook

This project demonstrates the complete tech stack of a production-grade RAG system, with each component carefully designed and performance-verified. The local-first architecture is suitable for privacy-sensitive scenarios, and the modular code facilitates customization and expansion. Future exploration directions: multi-modal retrieval (images, tables), real-time incremental index updates, and efficient quantization schemes to reduce storage and computing costs.

Section 06

Performance Tuning and Practical Recommendations

Tuning recommendations based on test results:

When corpus size <500: Use brute-force search instead of HNSW index;
HNSW parameters: Adjust ef_search as corpus size increases (for N≥50000, recommend ≥200);
Hybrid retrieval weights: Increase BM25 weight for term-dense documents (e.g., legal, medical), increase semantic weight for concept-dense documents (e.g., philosophy, literature);
Chunk settings: 256 tokens are suitable for precise fact retrieval; 512+ tokens preserve context coherence; overlap rate is recommended to be15-20%.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15