Reading

Hybrid RAG: An End-to-End Retrieval-Augmented Generation Solution Combining Keyword and Semantic Search

A complete RAG pipeline implementation that combines dense vector retrieval and sparse keyword search, integrating Cross-Encoder re-ranking, local LLM inference, RAGAS evaluation, and LangSmith observability

RAG混合检索稠密向量搜索稀疏关键词搜索Cross-EncoderLLM推理RAGAS评估LangSmith

Published 2026-06-16 01:16Recent activity 2026-06-16 01:22Estimated read 6 min

Section 01

Hybrid RAG: Introduction to the End-to-End Retrieval-Augmented Generation Solution Combining Keyword and Semantic Search

Project Basic Information

Original Author/Maintainer: DEVANSHU-KALI
Source Platform: GitHub
Original Link: https://github.com/DEVANSHU-KALI/Hybrid_RAG-Combining-keyword-and-semantic-search
Core Solution: This project provides a production-ready end-to-end RAG pipeline that combines dense vector retrieval and sparse keyword search, integrating Cross-Encoder re-ranking, local LLM inference, RAGAS evaluation, and LangSmith observability to address the limitations of traditional RAG systems in exact matching scenarios.

Section 02

Evolution of Retrieval-Augmented Generation and Background of Hybrid Retrieval

Retrieval-Augmented Generation (RAG) is a mainstream solution to address LLM hallucinations and knowledge timeliness issues. However, traditional RAG relies on pure semantic vector search, which performs poorly in scenarios requiring exact matching of proper nouns, product models, code identifiers, etc. Hybrid retrieval technology bridges this gap by combining the depth of semantic understanding with the precision of keyword matching, improving retrieval quality across a wide range of query scenarios.

Section 03

Project Architecture and Detailed Explanation of Hybrid Retrieval Mechanism

Core Architecture Components

Hybrid Retrieval Layer: Performs both dense vector retrieval and sparse keyword search simultaneously
Intelligent Re-ranking: Cross-Encoder model refines the initial results
Local LLM Inference: Supports private deployment
Quality Evaluation: RAGAS framework
Observability: LangSmith tracking and monitoring

Hybrid Retrieval Mechanism

Dense Vector Retrieval: Uses embedding models like sentence-transformers to generate vectors, calculates semantic relevance, and excels at concept-related queries
Sparse Keyword Search: Based on inverted index/BM25 algorithm, enables exact matching of specific identifiers and technical terms
Result Fusion Strategy: Adopts reciprocal rank fusion (RRF), weighted linear combination, or cascaded filtering to balance recall and precision

Section 04

Cross-Encoder Re-ranking and Advantages of Local LLM Inference

Cross-Encoder Re-ranking

The initial retrieval yields many candidate documents. Cross-Encoder concatenates the query and documents and feeds them into the model, outputting fine-grained relevance scores. This reduces the candidate set to the most relevant documents and improves generation quality (better at capturing complex interactions compared to Bi-Encoder).

Local LLM Inference

Supports local deployment, ensuring sensitive data does not leave the local environment to meet compliance requirements; eliminates external API dependencies, reducing costs and network latency.

Section 05

RAGAS Evaluation and LangSmith Observability

RAGAS Evaluation Framework

Provides multi-dimensional automated evaluation:

Context Relevance: Matching degree between retrieved documents and query
Faithfulness: Whether generated content is based on retrieved documents (no hallucinations)
Answer Relevance: Whether generated content directly answers the query
Context Recall: Whether retrieved documents contain all required information

LangSmith Observability

Request Tracking: Complete recording of processing flow
Latency Analysis: Identifying performance bottlenecks
Retrieval Visualization: Viewing documents and scores
Debugging Support: Locating retrieval/generation issues

Section 06

Practical Significance and Deployment Recommendations

Practical Significance

This project has a complete tech stack and is an ideal starting point for building enterprise-level RAG systems: hybrid retrieval covers a wide range of queries, Cross-Encoder improves quality, local LLM ensures privacy, and RAGAS and LangSmith support continuous optimization.

Deployment Recommendations

Adjust the weights of dense and sparse retrieval
Fine-tune embedding models and re-ranking models for domain-specific data
Establish a continuous evaluation feedback loop

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23