# Semantic Cache: A Distributed Semantic Caching Layer for LLM Inference

> An OpenAI-compatible proxy service that implements semantic-level caching via vector similarity matching, allowing reuse of existing answers for similar questions, significantly reducing API call costs and response latency.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-14T16:13:57.000Z
- 最近活动: 2026-06-14T16:20:32.406Z
- 热度: 161.9
- 关键词: LLM, 缓存, 向量搜索, OpenAI, Qdrant, 语义相似性, 性能优化, FastAPI, 多租户
- 页面链接: https://www.zingnex.cn/en/forum/thread/semantic-cache-llm
- Canonical: https://www.zingnex.cn/forum/thread/semantic-cache-llm
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Semantic Cache: A Distributed Semantic Caching Layer for LLM Inference

An OpenAI-compatible proxy service that implements semantic-level caching via vector similarity matching, allowing reuse of existing answers for similar questions, significantly reducing API call costs and response latency.

## Original Author and Source

- **Original Author/Maintainer**: Dhivakar A V (SRM IST-Trichy, CSE AI/ML Program, Class of 2027)
- **Source Platform**: GitHub
- **Original Title**: semantic-cache
- **Original Link**: <https://github.com/dhivakarav/semantic-cache>
- **Publication Date**: June 14, 2026

---

## Background: Why Do We Need Semantic Caching?

With the booming development of Large Language Model (LLM) applications today, API call costs have become a core expense for many products. Traditional caching strategies are based on exact matching—cache hits only occur when the user input is exactly the same as a historical query. However, in real-world scenarios, users often express the same need using different phrasing.

"How's the weather in Beijing?" and "Will it rain in Beijing today?" are essentially the same question, but traditional caching treats them as completely different queries. Such semantically redundant requests lead to a large number of unnecessary API calls, wasting costs and increasing response latency.

---

## Project Overview

Semantic Cache is a distributed semantic caching layer designed specifically for LLM inference scenarios. It runs as an OpenAI-compatible proxy service, intercepting all API calls, determining semantic similarity via vector embeddings and Approximate Nearest Neighbor (ANN) search, and directly returning cached results when the similarity exceeds a threshold.

Core features include:

- **Semantic-level Matching**: Generates 1536-dimensional vector representations based on the OpenAI text-embedding-3-small model
- **Qdrant Vector Storage**: Efficient ANN search with support for TTL expiration and multi-tenant isolation
- **Streaming Response Support**: Full support for caching and playback of SSE (Server-Sent Events) streaming
- **Intelligent Threshold Calibration**: Configures different similarity thresholds for different query types (factual, code, creative)
- **Cold Start Preheating**: Pre-generates representative answers via k-means clustering of historical query logs

---

## Technical Architecture Analysis

The entire system works collaboratively with several key components:

## 1. FastAPI Proxy Layer

The proxy service listens on port 8000 and provides a fully OpenAI-compatible API interface. When a request is received, it performs the following steps:

1. Uses SHA-256 to compute a fingerprint for the system prompt
2. Calls the OpenAI embedding service to convert user input into a vector
3. Performs an ANN search in Qdrant to find similar historical queries
4. If a cached item with similarity exceeding the threshold is found, directly returns the cached result
5. Otherwise, forwards to the upstream LLM and stores the result in the cache

## 2. Multi-Tenant Isolation Mechanism

Each tenant's cache is isolated via a namespace using `{tenant_id}:{system_prompt_fingerprint}`. This means:

- Data from different organizations is completely isolated, with no risk of cross-tenant leakage
- Different system prompts within the same tenant are also cached separately to avoid context confusion
- The tenant ID is passed via the HTTP header `X-Tenant-ID`, enabling seamless switching

## 3. Query Type-Aware Threshold Strategy

The project abandons fixed cosine similarity thresholds and adopts a more refined classification strategy:

| Query Type | Default Threshold | Design Considerations |
|------------|-------------------|-----------------------|
| Factual Query | 0.96 | Requires high accuracy to avoid cached reuse of incorrect answers |
| Code Query | 0.94 | Code semantics are sensitive; minor differences can lead to completely different results |
| Creative Query | 0.90 | Allows greater semantic drift; similar questions can share creative inspiration |

Furthermore, the project also implements a threshold calibrator based on logistic regression. By training on sample pairs of `(query_A, query_B, should_cache: bool)`, the classifier's performance is improved by approximately 15% compared to fixed thresholds.
