Section 01
Hands-On Decoupled Inference Architecture: Guide to llm-d's 70% Throughput Improvement on AWS EKS
This article introduces an open-source project that deploys the llm-d decoupled inference framework on AWS EKS. By separating the prefill and decode stages into different Pods and using EFA RDMA for millisecond-level KV cache transfer, it ultimately achieves up to a 70% improvement in LLM inference throughput. This architecture addresses the resource requirement conflicts between the prefill (compute-intensive) and decode (memory bandwidth-intensive) stages in traditional inference deployments, providing an efficient solution for large-scale LLM inference services.