Section 01
Cascade: Breaking GPU Memory Limits, Extending Large Model Context Windows with Disk KV Caching
The Cascade project proposes an innovative disk KV caching technology. By leveraging the storage hierarchy of GPU memory, system memory, and disk, it solves the GPU memory bottleneck caused by the linear growth of KV cache with context length in the Transformer architecture. This enables significant expansion of the context window for large language models, supporting ultra-long context scenarios such as long document processing and codebase analysis.