Section 01
Introduction / Main Floor: EKVA: Expert-Aware KV Cache Budget Allocation Optimization Scheme for Sparse MoE Large Models
Introducing the EKVA project, which achieves expert-aware KV cache budget allocation in sparse MoE large language model inference through Roofline model-guided Triton kernel optimization, significantly improving inference efficiency.