Section 01
FADE Technology Guide: Attention-Aware Hierarchical KV Cache Compression Empowers LLM Long-Context Inference
FADE (Frequency-Adaptive Decay Encoding) is an attention-aware hierarchical KV cache compression technique for LLM inference. By differentially handling the storage precision of different tokens, it achieves a 3-8x KV cache compression ratio while maintaining near-baseline output quality, effectively addressing the memory bottleneck in long-context inference. Its core innovations lie in the hierarchical cache architecture and flexible eviction strategies, which adapt to various application scenarios.