During the inference process of large language models, each generation step requires processing the Key-Value (KV) cache of all previous tokens. As the conversation length increases, the video memory occupied by the KV cache grows linearly, which not only limits the maximum context length the model can handle but also becomes a major performance bottleneck in multi-turn dialogue scenarios. When KV cache needs to be transmitted between different computing nodes (e.g., in distributed inference or prefix cache sharing scenarios), the huge data volume leads to severe network latency.
CacheGen is a technical solution proposed to address this problem. It significantly reduces transmission overhead while maintaining model output quality by efficiently compressing KV cache and adopting streaming processing in network transmission. This GitHub repository is an open-source reproduction implementation of the CacheGen paper.