Section 01
[Introduction] Mooncake: Core Analysis of the High-Performance LLM Inference Service Architecture Behind Kimi
Mooncake is an inference service platform built by Moonshot AI for its flagship large language model service Kimi. It corely adopts a KVCache-centric decoupled architecture, decouples Prefill and Decode clusters via Transfer Engine, supports multiple transmission protocols like RDMA/CXL/NVMe-oF, and has integrated mainstream inference frameworks such as vLLM, SGLang, and TensorRT-LLM. The platform has open-sourced key components and won the Best Paper Award at the FAST conference, serving as an important reference for LLM inference infrastructure.