Zing Forum

Reading

DaseR: A RAG-Native KV Cache Service to Accelerate LLM Inference

A KV cache service specifically designed for Retrieval-Augmented Generation (RAG), which significantly reduces Time To First Token (TTFT) latency and improves long-context inference efficiency by preloading document vector caches.

RAGKV cacheinference optimizationLLMretrieval-augmented generationcachingperformance
Published 2026-06-09 18:03Recent activity 2026-06-09 18:20Estimated read 6 min
DaseR: A RAG-Native KV Cache Service to Accelerate LLM Inference
1

Section 01

DaseR: A RAG-Native KV Cache Service to Accelerate LLM Inference

DaseR is a KV cache service specifically designed for Retrieval-Augmented Generation (RAG) scenarios. It significantly reduces Time To First Token (TTFT) and improves long-context inference efficiency by preloading document vector caches. This post will introduce its background, architecture, performance benefits, application scenarios, and future outlook.

2

Section 02

Project Background: Performance Bottlenecks in RAG Inference

Retrieval-Augmented Generation (RAG) has become a mainstream LLM application architecture, but it faces unique performance challenges: each request requires processing large retrieved document contexts, leading to increased TTFT and poor user experience. Traditional KV cache mechanisms optimize for conversation history but lack efficient strategies for static knowledge (e.g., product manuals, technical docs) that repeat across queries. DaseR addresses this pain point as a RAG-native KV cache service.

3

Section 03

Core Architecture: Decoupling Static Knowledge and Dynamic Queries

DaseR's core design decouples static parts (retrieved documents) and dynamic parts (user queries) in RAG inference:

  1. Document-level KV Cache: Persistently stores Key-Value representations of retrieved documents; reuses them when documents appear again to avoid repeated computation.
  2. Dynamic Query Splicing: Efficiently splices user queries with cached document KV states (documents account for >80% of input tokens, so skipping their computation reduces TTFT).
  3. Cache Consistency Management: Provides invalidation and update mechanisms to precisely refresh affected document caches when knowledge bases change.
4

Section 04

Technical Implementation and Performance Benefits

Key technical implementations:

  • Prefix Sharing Optimization: Leverages Transformer decoder's prefix sharing to reuse document KV caches across related queries.
  • Memory-Efficient Storage: May use quantization (INT8/FP8) or hierarchical storage (hot data in GPU memory, warm data in host memory/SSD) to reduce memory usage.
  • Service Deployment: Integrates with mainstream inference engines (vLLM, TensorRT-LLM) as an independent service without modifying model architecture. Performance gains: In typical RAG scenarios (3-5 long documents), TTFT can drop from seconds to hundreds of milliseconds (10x improvement), which is valuable for high-concurrency knowledge base Q&A applications.
5

Section 05

Application Scenarios and Ecological Value

DaseR applies to:

  • Enterprise Knowledge Base Q&A: Reduces response delay when employees query internal documents (same documents retrieved multiple times).
  • Customer Service Robots: Ideal for systems based on fixed product manuals/FAQs with high query volumes.
  • Legal/Medical Document Analysis: Benefits long documents and frequent queries in professional fields.
  • Multi-round Dialogue RAG: Maintains cache across rounds when context documents repeat.
6

Section 06

Comparison with Existing KV Cache Solutions

Compared to general KV cache schemes (e.g., vLLM's Prefix Caching), DaseR's differentiators:

  • RAG Semantic Awareness: Understands document structure, supports fine-grained caching (paragraph/doc level).
  • Cross-session Sharing: Shares document caches across users/sessions, not just current dialogue.
  • Knowledge Base Integration: Tighter integration with vector databases and retrievers for end-to-end RAG acceleration.
7

Section 07

Summary and Future Outlook

DaseR represents a new direction in RAG inference optimization: from general acceleration to scenario-specific optimization. It provides a targeted solution for static knowledge caching in RAG. As RAG applications grow, such specialized cache services will become key LLM infrastructure components. Future directions: deep integration with RAG frameworks (LangChain, LlamaIndex), distributed cache support, and intelligent caching based on document importance.