Zing Forum

Reading

Semantic Cache: A Distributed Semantic Caching Layer for LLM Inference

An OpenAI-compatible proxy service that implements semantic-level caching via vector similarity matching, allowing reuse of existing answers for similar questions, significantly reducing API call costs and response latency.

LLM缓存向量搜索OpenAIQdrant语义相似性性能优化FastAPI多租户
Published 2026-06-15 00:13Recent activity 2026-06-15 00:20Estimated read 7 min
Semantic Cache: A Distributed Semantic Caching Layer for LLM Inference
1

Section 01

Introduction / Main Floor: Semantic Cache: A Distributed Semantic Caching Layer for LLM Inference

An OpenAI-compatible proxy service that implements semantic-level caching via vector similarity matching, allowing reuse of existing answers for similar questions, significantly reducing API call costs and response latency.

2

Section 02

Original Author and Source

  • Original Author/Maintainer: Dhivakar A V (SRM IST-Trichy, CSE AI/ML Program, Class of 2027)
  • Source Platform: GitHub
  • Original Title: semantic-cache
  • Original Link: https://github.com/dhivakarav/semantic-cache
  • Publication Date: June 14, 2026

3

Section 03

Background: Why Do We Need Semantic Caching?

With the booming development of Large Language Model (LLM) applications today, API call costs have become a core expense for many products. Traditional caching strategies are based on exact matching—cache hits only occur when the user input is exactly the same as a historical query. However, in real-world scenarios, users often express the same need using different phrasing.

"How's the weather in Beijing?" and "Will it rain in Beijing today?" are essentially the same question, but traditional caching treats them as completely different queries. Such semantically redundant requests lead to a large number of unnecessary API calls, wasting costs and increasing response latency.


4

Section 04

Project Overview

Semantic Cache is a distributed semantic caching layer designed specifically for LLM inference scenarios. It runs as an OpenAI-compatible proxy service, intercepting all API calls, determining semantic similarity via vector embeddings and Approximate Nearest Neighbor (ANN) search, and directly returning cached results when the similarity exceeds a threshold.

Core features include:

  • Semantic-level Matching: Generates 1536-dimensional vector representations based on the OpenAI text-embedding-3-small model
  • Qdrant Vector Storage: Efficient ANN search with support for TTL expiration and multi-tenant isolation
  • Streaming Response Support: Full support for caching and playback of SSE (Server-Sent Events) streaming
  • Intelligent Threshold Calibration: Configures different similarity thresholds for different query types (factual, code, creative)
  • Cold Start Preheating: Pre-generates representative answers via k-means clustering of historical query logs

5

Section 05

Technical Architecture Analysis

The entire system works collaboratively with several key components:

6

Section 06

1. FastAPI Proxy Layer

The proxy service listens on port 8000 and provides a fully OpenAI-compatible API interface. When a request is received, it performs the following steps:

  1. Uses SHA-256 to compute a fingerprint for the system prompt
  2. Calls the OpenAI embedding service to convert user input into a vector
  3. Performs an ANN search in Qdrant to find similar historical queries
  4. If a cached item with similarity exceeding the threshold is found, directly returns the cached result
  5. Otherwise, forwards to the upstream LLM and stores the result in the cache
7

Section 07

2. Multi-Tenant Isolation Mechanism

Each tenant's cache is isolated via a namespace using {tenant_id}:{system_prompt_fingerprint}. This means:

  • Data from different organizations is completely isolated, with no risk of cross-tenant leakage
  • Different system prompts within the same tenant are also cached separately to avoid context confusion
  • The tenant ID is passed via the HTTP header X-Tenant-ID, enabling seamless switching
8

Section 08

3. Query Type-Aware Threshold Strategy

The project abandons fixed cosine similarity thresholds and adopts a more refined classification strategy:

Query Type Default Threshold Design Considerations
Factual Query 0.96 Requires high accuracy to avoid cached reuse of incorrect answers
Code Query 0.94 Code semantics are sensitive; minor differences can lead to completely different results
Creative Query 0.90 Allows greater semantic drift; similar questions can share creative inspiration

Furthermore, the project also implements a threshold calibrator based on logistic regression. By training on sample pairs of (query_A, query_B, should_cache: bool), the classifier's performance is improved by approximately 15% compared to fixed thresholds.