Zing Forum

Reading

Mooncake: In-Depth Analysis of the High-Performance LLM Inference Service Architecture Behind Kimi

Moonshot AI's open-source underlying platform for the Kimi service adopts a KVCache-centric decoupled architecture. It decouples Prefill and Decode clusters via Transfer Engine, supports multiple transmission protocols such as RDMA/CXL/NVMe-oF, and has integrated mainstream inference frameworks like vLLM, SGLang, and TensorRT-LLM.

LLM推理KVCacheMooncake分离式架构Transfer EngineRDMAPrefill-DecodevLLMSGLangMoonshot AI
Published 2026-04-30 16:12Recent activity 2026-04-30 16:25Estimated read 6 min
Mooncake: In-Depth Analysis of the High-Performance LLM Inference Service Architecture Behind Kimi
1

Section 01

[Introduction] Mooncake: Core Analysis of the High-Performance LLM Inference Service Architecture Behind Kimi

Mooncake is an inference service platform built by Moonshot AI for its flagship large language model service Kimi. It corely adopts a KVCache-centric decoupled architecture, decouples Prefill and Decode clusters via Transfer Engine, supports multiple transmission protocols like RDMA/CXL/NVMe-oF, and has integrated mainstream inference frameworks such as vLLM, SGLang, and TensorRT-LLM. The platform has open-sourced key components and won the Best Paper Award at the FAST conference, serving as an important reference for LLM inference infrastructure.

2

Section 02

Project Background and Open-Source Significance

Mooncake is an inference platform built by Moonshot AI for Kimi. It released a technical report in June 2024, open-sourced the core components of Transfer Engine in November 2024, and open-sourced Mooncake Store in March 2025. It won the Best Paper Award at the FAST conference in February 2025 and joined the PyTorch ecosystem the same year, becoming an officially supported inference acceleration component.

3

Section 03

Core Architecture: KVCache-Centric Decoupled Design

Mooncake innovatively adopts a KVCache-centric decoupled architecture:

  1. Prefill-Decode Decoupling: Deploy prompt processing (compute-intensive) and token generation (memory bandwidth-limited) on different GPU clusters to achieve resource specialization optimization, independent scaling, and flexible scheduling;
  2. Decoupled KVCache Pool: Use CPU/DRAM/SSD to build cross-layer storage (hot data in GPU memory, warm data in DRAM, cold data in SSD), supporting cache reuse between requests and elastic scaling.
4

Section 04

Key Components: Transfer Engine and Distributed Storage

  • Transfer Engine: A unified transmission interface that supports multiple protocols including RDMA (8×400Gbps reaching 190GB/s), NVMe-oF, and CXL. It has topology-aware path selection and multi-NIC bandwidth aggregation capabilities;
  • Mooncake Store: A distributed KVCache storage engine that supports multi-replica, striped parallel transmission, and hierarchical storage. It has been integrated with frameworks like SGLang HiCache and vLLM;
  • P2P Store: Decentralized peer-to-peer object sharing for checkpoint transmission, supporting 1T parameter model updates across thousands of GPUs in 20 seconds.
5

Section 05

Performance Verification and Production Performance

  • Throughput increased by up to 525% in simulated long-context scenarios;
  • Request volume under Kimi's real load was 75% higher than the baseline;
  • In July 2025, it supported the deployment of the K2 model on 128 H200 GPUs, achieving a Prefill throughput of 224k tokens/sec and a Decode throughput of 288k tokens/sec.
6

Section 06

Ecosystem Integration and Industry Applications

Mooncake has integrated mainstream inference frameworks: vLLM (v1 version supports PD decoupling), SGLang (HiCache backend and EPD decoupling), TensorRT-LLM (KVCache transmission backend), etc. It also supports elastic expert parallelism (fault detection, dynamic token routing) and tensor-centric ecosystem (full-stack tensor processing).

7

Section 07

Summary and Future Outlook

Mooncake represents the evolution direction of LLM inference architecture. It optimizes throughput and resource utilization through decoupled design and efficient transmission components. With the integration of mainstream frameworks, it is becoming one of the de facto standards for LLM inference infrastructure, providing production-verified architectural references and open-source components for large-scale LLM services.