Reading

Semantic Cache: A Distributed Semantic Caching Layer for LLM Inference

An OpenAI-compatible proxy service that implements semantic-level caching via vector similarity matching, allowing reuse of existing answers for similar questions, significantly reducing API call costs and response latency.

LLM缓存向量搜索OpenAIQdrant语义相似性性能优化FastAPI多租户

Published 2026-06-15 00:13Recent activity 2026-06-15 00:20Estimated read 7 min

Section 01

Introduction / Main Floor: Semantic Cache: A Distributed Semantic Caching Layer for LLM Inference

Section 02

Original Author and Source

Original Author/Maintainer: Dhivakar A V (SRM IST-Trichy, CSE AI/ML Program, Class of 2027)
Source Platform: GitHub
Original Title: semantic-cache
Original Link: https://github.com/dhivakarav/semantic-cache
Publication Date: June 14, 2026

Section 03

Background: Why Do We Need Semantic Caching?

With the booming development of Large Language Model (LLM) applications today, API call costs have become a core expense for many products. Traditional caching strategies are based on exact matching—cache hits only occur when the user input is exactly the same as a historical query. However, in real-world scenarios, users often express the same need using different phrasing.

"How's the weather in Beijing?" and "Will it rain in Beijing today?" are essentially the same question, but traditional caching treats them as completely different queries. Such semantically redundant requests lead to a large number of unnecessary API calls, wasting costs and increasing response latency.

Section 04

Project Overview

Semantic Cache is a distributed semantic caching layer designed specifically for LLM inference scenarios. It runs as an OpenAI-compatible proxy service, intercepting all API calls, determining semantic similarity via vector embeddings and Approximate Nearest Neighbor (ANN) search, and directly returning cached results when the similarity exceeds a threshold.

Core features include:

Semantic-level Matching: Generates 1536-dimensional vector representations based on the OpenAI text-embedding-3-small model
Qdrant Vector Storage: Efficient ANN search with support for TTL expiration and multi-tenant isolation
Streaming Response Support: Full support for caching and playback of SSE (Server-Sent Events) streaming
Intelligent Threshold Calibration: Configures different similarity thresholds for different query types (factual, code, creative)
Cold Start Preheating: Pre-generates representative answers via k-means clustering of historical query logs

Section 05

Technical Architecture Analysis

The entire system works collaboratively with several key components:

Section 06

1. FastAPI Proxy Layer

The proxy service listens on port 8000 and provides a fully OpenAI-compatible API interface. When a request is received, it performs the following steps:

Uses SHA-256 to compute a fingerprint for the system prompt
Calls the OpenAI embedding service to convert user input into a vector
Performs an ANN search in Qdrant to find similar historical queries
If a cached item with similarity exceeding the threshold is found, directly returns the cached result
Otherwise, forwards to the upstream LLM and stores the result in the cache

Section 07

2. Multi-Tenant Isolation Mechanism

Each tenant's cache is isolated via a namespace using {tenant_id}:{system_prompt_fingerprint}. This means:

Data from different organizations is completely isolated, with no risk of cross-tenant leakage
Different system prompts within the same tenant are also cached separately to avoid context confusion
The tenant ID is passed via the HTTP header X-Tenant-ID, enabling seamless switching

Section 08

3. Query Type-Aware Threshold Strategy

The project abandons fixed cosine similarity thresholds and adopts a more refined classification strategy:

Query Type	Default Threshold	Design Considerations
Factual Query	0.96	Requires high accuracy to avoid cached reuse of incorrect answers
Code Query	0.94	Code semantics are sensitive; minor differences can lead to completely different results
Creative Query	0.90	Allows greater semantic drift; similar questions can share creative inspiration

Furthermore, the project also implements a threshold calibrator based on logistic regression. By training on sample pairs of (query_A, query_B, should_cache: bool), the classifier's performance is improved by approximately 15% compared to fixed thresholds.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23