Zing Forum

Reading

Building a Production-Grade RAG System from Scratch: Practical Implementation of HNSW-Based Vector Retrieval and Multi-Stage Recall Pipeline

This article provides an in-depth analysis of a complete local-first RAG chatbot backend implementation, covering core mechanisms such as HNSW approximate nearest neighbor indexing, hybrid retrieval (dense vectors + BM25), cross-encoder re-ranking, and MMR diversity deduplication. It also offers detailed performance benchmark data and architectural design insights.

RAGHNSW向量检索近似最近邻混合检索BM25交叉编码器重排序本地LLMOllama
Published 2026-03-30 07:07Recent activity 2026-03-30 07:22Estimated read 6 min
Building a Production-Grade RAG System from Scratch: Practical Implementation of HNSW-Based Vector Retrieval and Multi-Stage Recall Pipeline
1

Section 01

Introduction: Core Practices for Building a Production-Grade Local-First RAG System from Scratch

This article introduces a complete local-first RAG chatbot backend project, covering core mechanisms like HNSW vector indexing, hybrid retrieval (dense vectors + BM25), cross-encoder re-ranking, and MMR diversity deduplication. It provides detailed performance benchmark data and architectural design insights, aiming to address the knowledge timeliness and hallucination issues in LLM applications. All embedding calculations and text generation are done locally to ensure user data privacy.

2

Section 02

Project Background: Pain Points of LLM Applications and Limitations of Existing Solutions

In LLM application development, RAG has become a standard solution to address knowledge timeliness and hallucination issues. However, most tutorials only cover simple vector similarity search and lack discussions on key production environment issues: How to achieve low-latency retrieval while ensuring recall rate? How to handle incremental updates of massive documents? How to balance semantic understanding and keyword matching? This project provides a complete solution to these problems.

3

Section 03

Core Methods and System Architecture

The system uses FastAPI to build a RESTful API backend, with the core design concept of "local-first". The architecture is divided into three layers:

  1. Document Ingestion Layer: Supports loading, cleaning, chunking (token/sentence mode, default 450-token window +80-token overlap), and vectorization of multi-format documents;
  2. Vector Storage Layer: Isolates user-specific corpora, supports document version management and soft deletion;
  3. Retrieval Service Layer: Implements a five-stage pipeline: Multi-query Rewriting → Hybrid Retrieval (HNSW dense + BM25 sparse, fusion formula: score =0.65×semantic +0.35×keyword) → MMR Diversity Deduplication → Cross-Encoder Re-ranking → LLM Generation (local Ollama call to Qwen2.5:7B-Instruct).
4

Section 04

Performance Benchmark: HNSW vs. Brute-force Search

The project provides benchmark scripts, tested on an HP EliteBook 840 G4 with different corpus sizes:

  • When N=25000: HNSW reduces latency by 13.8x compared to brute-force search (median:0.656ms vs9.071ms), with Recall@5 remaining at100%;
  • When N=50000: Speed advantage is13.07x, but recall rate drops to60% (need to increase ef_search parameter);
  • When N<500: Brute-force search is faster, as HNSW's fixed overhead exceeds linear scan cost.
5

Section 05

Project Summary and Future Outlook

This project demonstrates the complete tech stack of a production-grade RAG system, with each component carefully designed and performance-verified. The local-first architecture is suitable for privacy-sensitive scenarios, and the modular code facilitates customization and expansion. Future exploration directions: multi-modal retrieval (images, tables), real-time incremental index updates, and efficient quantization schemes to reduce storage and computing costs.

6

Section 06

Performance Tuning and Practical Recommendations

Tuning recommendations based on test results:

  1. When corpus size <500: Use brute-force search instead of HNSW index;
  2. HNSW parameters: Adjust ef_search as corpus size increases (for N≥50000, recommend ≥200);
  3. Hybrid retrieval weights: Increase BM25 weight for term-dense documents (e.g., legal, medical), increase semantic weight for concept-dense documents (e.g., philosophy, literature);
  4. Chunk settings: 256 tokens are suitable for precise fact retrieval; 512+ tokens preserve context coherence; overlap rate is recommended to be15-20%.