# Mooncake: In-Depth Analysis of the High-Performance LLM Inference Service Architecture Behind Kimi

> Moonshot AI's open-source underlying platform for the Kimi service adopts a KVCache-centric decoupled architecture. It decouples Prefill and Decode clusters via Transfer Engine, supports multiple transmission protocols such as RDMA/CXL/NVMe-oF, and has integrated mainstream inference frameworks like vLLM, SGLang, and TensorRT-LLM.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-30T08:12:42.000Z
- 最近活动: 2026-04-30T08:25:36.843Z
- 热度: 154.8
- 关键词: LLM推理, KVCache, Mooncake, 分离式架构, Transfer Engine, RDMA, Prefill-Decode, vLLM, SGLang, Moonshot AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/mooncake-kimillm
- Canonical: https://www.zingnex.cn/forum/thread/mooncake-kimillm
- Markdown 来源: floors_fallback

---

## [Introduction] Mooncake: Core Analysis of the High-Performance LLM Inference Service Architecture Behind Kimi

Mooncake is an inference service platform built by Moonshot AI for its flagship large language model service Kimi. It corely adopts a KVCache-centric decoupled architecture, decouples Prefill and Decode clusters via Transfer Engine, supports multiple transmission protocols like RDMA/CXL/NVMe-oF, and has integrated mainstream inference frameworks such as vLLM, SGLang, and TensorRT-LLM. The platform has open-sourced key components and won the Best Paper Award at the FAST conference, serving as an important reference for LLM inference infrastructure.

## Project Background and Open-Source Significance

Mooncake is an inference platform built by Moonshot AI for Kimi. It released a technical report in June 2024, open-sourced the core components of Transfer Engine in November 2024, and open-sourced Mooncake Store in March 2025. It won the Best Paper Award at the FAST conference in February 2025 and joined the PyTorch ecosystem the same year, becoming an officially supported inference acceleration component.

## Core Architecture: KVCache-Centric Decoupled Design

Mooncake innovatively adopts a KVCache-centric decoupled architecture:
1. Prefill-Decode Decoupling: Deploy prompt processing (compute-intensive) and token generation (memory bandwidth-limited) on different GPU clusters to achieve resource specialization optimization, independent scaling, and flexible scheduling;
2. Decoupled KVCache Pool: Use CPU/DRAM/SSD to build cross-layer storage (hot data in GPU memory, warm data in DRAM, cold data in SSD), supporting cache reuse between requests and elastic scaling.

## Key Components: Transfer Engine and Distributed Storage

- **Transfer Engine**: A unified transmission interface that supports multiple protocols including RDMA (8×400Gbps reaching 190GB/s), NVMe-oF, and CXL. It has topology-aware path selection and multi-NIC bandwidth aggregation capabilities;
- **Mooncake Store**: A distributed KVCache storage engine that supports multi-replica, striped parallel transmission, and hierarchical storage. It has been integrated with frameworks like SGLang HiCache and vLLM;
- **P2P Store**: Decentralized peer-to-peer object sharing for checkpoint transmission, supporting 1T parameter model updates across thousands of GPUs in 20 seconds.

## Performance Verification and Production Performance

- Throughput increased by up to 525% in simulated long-context scenarios;
- Request volume under Kimi's real load was 75% higher than the baseline;
- In July 2025, it supported the deployment of the K2 model on 128 H200 GPUs, achieving a Prefill throughput of 224k tokens/sec and a Decode throughput of 288k tokens/sec.

## Ecosystem Integration and Industry Applications

Mooncake has integrated mainstream inference frameworks: vLLM (v1 version supports PD decoupling), SGLang (HiCache backend and EPD decoupling), TensorRT-LLM (KVCache transmission backend), etc. It also supports elastic expert parallelism (fault detection, dynamic token routing) and tensor-centric ecosystem (full-stack tensor processing).

## Summary and Future Outlook

Mooncake represents the evolution direction of LLM inference architecture. It optimizes throughput and resource utilization through decoupled design and efficient transmission components. With the integration of mainstream frameworks, it is becoming one of the de facto standards for LLM inference infrastructure, providing production-verified architectural references and open-source components for large-scale LLM services.
