Section 01
DUAL-BLADE: Guide to the KV Cache Offloading Framework for Edge Devices
This article introduces DUAL-BLADE, a dual-path KV cache residency framework for edge AI systems. By dynamically allocating KV tensors to either the page cache path or the NVMe direct access path, the framework bypasses file system overhead to achieve low-latency direct storage access, reducing latency by 33.1% in the prefill phase and 42.4% in the decoding phase. It aims to address the problem of limited memory resources in LLM inference on edge devices.