With the booming development of Large Language Model (LLM) applications today, API call costs have become a core expense for many products. Traditional caching strategies are based on exact matching—cache hits only occur when the user input is exactly the same as a historical query. However, in real-world scenarios, users often express the same need using different phrasing.
"How's the weather in Beijing?" and "Will it rain in Beijing today?" are essentially the same question, but traditional caching treats them as completely different queries. Such semantically redundant requests lead to a large number of unnecessary API calls, wasting costs and increasing response latency.