# CacheFlow: A Multi-Request LLM Inference Optimization Engine Based on llama.cpp

> CacheFlow is a high-performance multi-request inference optimization engine built on top of llama.cpp. It significantly improves throughput and latency performance under concurrent loads through continuous batching, a concurrency-aware scheduler, and block-based KV cache management.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-06T03:43:02.000Z
- 最近活动: 2026-06-06T03:50:33.358Z
- 热度: 161.9
- 关键词: LLM, 推理优化, PagedAttention, KV缓存, 连续批处理, CUDA, llama.cpp, GPU加速, 并发调度
- 页面链接: https://www.zingnex.cn/en/forum/thread/cacheflow-llama-cpp-llm
- Canonical: https://www.zingnex.cn/forum/thread/cacheflow-llama-cpp-llm
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: CacheFlow: A Multi-Request LLM Inference Optimization Engine Based on llama.cpp

CacheFlow is a high-performance multi-request inference optimization engine built on top of llama.cpp. It significantly improves throughput and latency performance under concurrent loads through continuous batching, a concurrency-aware scheduler, and block-based KV cache management.

## Original Author and Source

- **Original Author/Maintainer:** yupengtang
- **Source Platform:** GitHub
- **Original Title:** CacheFlow
- **Original Link:** https://github.com/yupengtang/CacheFlow
- **Publication Date:** June 6, 2026

---

## Project Background and Positioning

In LLM inference services, sequential processing of single requests often fails to fully utilize GPU computing power. As the number of concurrent requests increases, efficiently managing KV cache, scheduling request batches, and reducing memory fragmentation have become key factors affecting inference performance.

CacheFlow is an open-source inference optimization engine designed to address these issues. Built on the popular llama.cpp project, it achieves true continuous batching and intelligent KV cache management by redesigning the autoregressive decoding path, enabling a 1.5-2.0x throughput improvement in concurrent scenarios.

---

## 1. Continuous Batching and Concurrency-Aware Scheduler

One of CacheFlow's core innovations is its Continuous Batching mechanism. Unlike traditional static batching, CacheFlow's scheduler can dynamically add or remove sequences at each decoding step, ensuring the GPU is always saturated.

The scheduler supports multiple scheduling strategies:

- **FCFS (First-Come, First-Served):** Ensures request order
- **SJF (Shortest Remaining Time First):** Optimizes overall completion time
- **Priority Scheduling:** Supports preemption of high-priority requests

When GPU memory is insufficient, the scheduler intelligently preempts low-priority sequences, swaps their KV blocks to CPU memory, and resumes execution when resources are available without re-computation.

## 2. Block-Based KV Cache Management (PagedAttention)

CacheFlow uses PagedAttention technology to store KV cache in fixed-size blocks instead of allocating contiguous memory for each sequence. This design offers multiple advantages:

- **Eliminates external fragmentation:** Fixed block size avoids memory waste in traditional allocation
- **Copy-on-Write (COW) sharing:** Multiple requests can share the same KV block until modification is needed
- **Prefix-aware caching:** Automatically reuses KV blocks with shared prefixes via a Trie-based lookup mechanism

This block table mapping mechanism makes memory management more flexible, reducing latency variance by over 30% under long-running workloads.

## 3. Optimized CUDA Kernel Implementation

CacheFlow includes custom CUDA kernels optimized for different context lengths:

- **Paged Attention V1:** Suitable for short contexts (≤8K tokens), uses one warp per head to reduce partitioning overhead
- **Paged Attention V2:** Suitable for long contexts, uses a partitioning plus reduction strategy to fully utilize parallelism in the sequence dimension
- **Fused operation kernels:** Includes reshape-and-cache, block copy/swap, cache compression, etc.

These kernels reduce redundant memory movement by merging memory access patterns, further improving inference efficiency.

## 4. System-Level Performance Analysis Framework

CacheFlow has a built-in comprehensive performance analysis tool that tracks the following metrics:

- **TTFT (Time to First Token):** Time from request to first output token generation
- **TPOT (Time per Output Token):** Average time to generate each subsequent token
- **Throughput:** Number of tokens generated per second
- **KV Cache Utilization:** Memory usage efficiency

The analyzer supports generating timeline data in Chrome Trace JSON format for detailed performance profiling by developers. It also supports scalability curve testing for 1-16 concurrent requests to help users find the optimal configuration.

---

## Memory Management Strategy

CacheFlow uses a slab allocator to manage GPU memory, combined with a defragmentation mechanism to ensure memory stability during long runs. When fragmentation accumulation is detected, the system automatically performs compression operations to merge scattered blocks into contiguous physical locations.
