Section 01
HybridGen: A Hybrid Computing Architecture Breaking Through Long Context Inference Bottlenecks of Large Models
HybridGen addresses the KV cache bottleneck in long-context LLM inference through an innovative CPU-GPU collaborative attention mechanism combined with CXL extended memory technology, achieving a performance improvement of 1.41x to 3.2x and providing a new direction for AI system optimization in heterogeneous computing environments.