Reading

DaseR: A RAG-Native KV Cache Service to Accelerate LLM Inference

A KV cache service specifically designed for Retrieval-Augmented Generation (RAG), which significantly reduces Time To First Token (TTFT) latency and improves long-context inference efficiency by preloading document vector caches.

RAGKV cacheinference optimizationLLMretrieval-augmented generationcachingperformance

Published 2026-06-09 18:03Recent activity 2026-06-09 18:20Estimated read 6 min

Section 01

DaseR: A RAG-Native KV Cache Service to Accelerate LLM Inference

DaseR is a KV cache service specifically designed for Retrieval-Augmented Generation (RAG) scenarios. It significantly reduces Time To First Token (TTFT) and improves long-context inference efficiency by preloading document vector caches. This post will introduce its background, architecture, performance benefits, application scenarios, and future outlook.

Section 02

Project Background: Performance Bottlenecks in RAG Inference

Retrieval-Augmented Generation (RAG) has become a mainstream LLM application architecture, but it faces unique performance challenges: each request requires processing large retrieved document contexts, leading to increased TTFT and poor user experience. Traditional KV cache mechanisms optimize for conversation history but lack efficient strategies for static knowledge (e.g., product manuals, technical docs) that repeat across queries. DaseR addresses this pain point as a RAG-native KV cache service.

Section 03

Core Architecture: Decoupling Static Knowledge and Dynamic Queries

DaseR's core design decouples static parts (retrieved documents) and dynamic parts (user queries) in RAG inference:

Document-level KV Cache: Persistently stores Key-Value representations of retrieved documents; reuses them when documents appear again to avoid repeated computation.
Dynamic Query Splicing: Efficiently splices user queries with cached document KV states (documents account for >80% of input tokens, so skipping their computation reduces TTFT).
Cache Consistency Management: Provides invalidation and update mechanisms to precisely refresh affected document caches when knowledge bases change.

Section 04

Technical Implementation and Performance Benefits

Key technical implementations:

Prefix Sharing Optimization: Leverages Transformer decoder's prefix sharing to reuse document KV caches across related queries.
Memory-Efficient Storage: May use quantization (INT8/FP8) or hierarchical storage (hot data in GPU memory, warm data in host memory/SSD) to reduce memory usage.
Service Deployment: Integrates with mainstream inference engines (vLLM, TensorRT-LLM) as an independent service without modifying model architecture. Performance gains: In typical RAG scenarios (3-5 long documents), TTFT can drop from seconds to hundreds of milliseconds (10x improvement), which is valuable for high-concurrency knowledge base Q&A applications.

Section 05

Application Scenarios and Ecological Value

DaseR applies to:

Enterprise Knowledge Base Q&A: Reduces response delay when employees query internal documents (same documents retrieved multiple times).
Customer Service Robots: Ideal for systems based on fixed product manuals/FAQs with high query volumes.
Legal/Medical Document Analysis: Benefits long documents and frequent queries in professional fields.
Multi-round Dialogue RAG: Maintains cache across rounds when context documents repeat.

Section 06

Comparison with Existing KV Cache Solutions

Compared to general KV cache schemes (e.g., vLLM's Prefix Caching), DaseR's differentiators:

RAG Semantic Awareness: Understands document structure, supports fine-grained caching (paragraph/doc level).
Cross-session Sharing: Shares document caches across users/sessions, not just current dialogue.
Knowledge Base Integration: Tighter integration with vector databases and retrievers for end-to-end RAG acceleration.

Section 07

Summary and Future Outlook

DaseR represents a new direction in RAG inference optimization: from general acceleration to scenario-specific optimization. It provides a targeted solution for static knowledge caching in RAG. As RAG applications grow, such specialized cache services will become key LLM infrastructure components. Future directions: deep integration with RAG frameworks (LangChain, LlamaIndex), distributed cache support, and intelligent caching based on document importance.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23