Section 01
MoE-nD: Achieving 14x KV Cache Compression with Hierarchical Mixture-of-Experts Strategy While Preserving Long Text Inference Performance
MoE-nD breaks through the bottleneck of traditional uniform compression methods by customizing differentiated KV cache compression strategies for different Transformer layers. It maintains the original model performance even at a 14x compression ratio, paving the way for the practical application of long-text large language model inference.