Section 01
ScoutAttention: Guide to the LLM Inference Acceleration Scheme for Efficient KV Cache Offloading
ScoutAttention is a KV cache offloading framework for LLM long-context inference. Through GPU-CPU collaborative block-level sparse attention mechanism and pre-layer CPU precomputation algorithm, it achieves a 2.1x speedup compared to existing offloading methods with an accuracy loss of only 2.4%, effectively solving the GPU memory bottleneck problem in long-context inference.