# Hands-On Decoupled Inference Architecture: Deploying llm-d on AWS EKS for 70% Throughput Improvement

> This open-source project demonstrates how to deploy the llm-d decoupled inference framework on Amazon EKS. By separating the prefill and decode stages into different Pods and using EFA RDMA for millisecond-level KV cache transfer, it achieves up to a 70% improvement in LLM inference throughput.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-22T08:18:00.000Z
- 最近活动: 2026-04-22T08:25:17.827Z
- 热度: 159.9
- 关键词: 分离式推理, LLM推理优化, llm-d, Kubernetes, EFA RDMA, KV缓存, 预填充解码分离, AWS EKS
- 页面链接: https://www.zingnex.cn/en/forum/thread/aws-eksllm-d70
- Canonical: https://www.zingnex.cn/forum/thread/aws-eksllm-d70
- Markdown 来源: floors_fallback

---

## Hands-On Decoupled Inference Architecture: Guide to llm-d's 70% Throughput Improvement on AWS EKS

This article introduces an open-source project that deploys the llm-d decoupled inference framework on AWS EKS. By separating the prefill and decode stages into different Pods and using EFA RDMA for millisecond-level KV cache transfer, it ultimately achieves up to a 70% improvement in LLM inference throughput. This architecture addresses the resource requirement conflicts between the prefill (compute-intensive) and decode (memory bandwidth-intensive) stages in traditional inference deployments, providing an efficient solution for large-scale LLM inference services.

## Background and Core Principles of Decoupled Inference

LLM inference consists of two core stages: the prefill stage (compute-intensive, processing input prompts to generate KV cache) and the decode stage (memory bandwidth-intensive, generating output tokens one by one). Traditional deployments place both stages on the same GPU, leading to inefficiency due to different hardware requirements. The core of decoupled inference is to allocate the two stages to different hardware nodes, allowing each node to focus on its specialized tasks and improve overall efficiency.

## Design and Architecture of the llm-d Framework

llm-d is a Kubernetes-native distributed LLM inference framework whose core is to decouple the inference process into independent microservices. In the reference architecture, 2 prefill Pods (with tensor parallelism TP=4) and 1 decode Pod (TP=4) are deployed, and KV cache is transferred via high-speed network. The flexibility of this architecture lies in the ability to independently scale prefill/decode resources based on request characteristics (e.g., long prompts with short responses or vice versa).

## EFA RDMA: Key to High-Speed KV Cache Transfer

The challenge of decoupled inference is the latency of cross-node KV cache transfer. The project uses the RDMA capability of AWS EFA, allowing GPUs to directly write to remote memory, bypassing the OS kernel and TCP/IP protocol stack. Technically, it uses the NIXL library with the libfabric protocol. Actual tests show that KV transfer latency is about 2 milliseconds, and throughput exceeds 1GB/s. Using p5.48xlarge instances (equipped with 32 EFA interfaces, set to efa-only mode) and placing them in a cluster placement group further reduces network latency.

## Detailed Explanation of the Complete Infrastructure Architecture

The project provides Terraform configurations for one-click infrastructure deployment:
- Network layer: A VPC across 4 availability zones, equipped with NAT gateways to provide network access for private GPU nodes; EFA-specific security groups are configured with self-referencing inbound and outbound rules to ensure RDMA traffic communication.
- Compute layer: System nodes use m5.2xlarge instances to run Istio gateways, monitoring components, and EPP routers; GPU nodes use p5.48xlarge instances, each equipped with 8 H100 GPUs and 32 EFA interfaces, with a custom launch template configuring a 500GB gp3 storage volume.
- Service mesh: Istio acts as the service mesh to handle traffic routing and load balancing; Gateway API in conjunction with Inference Extension CRDs implements inference-aware traffic management.
- Intelligent routing: The EPP (Endpoint Picker) component implements cache-aware request routing, identifying decode Pods holding KV cache for specific requests to maximize cache reuse rate.
- Observability: Prometheus and Grafana provide full vLLM metric monitoring capabilities.

## Performance Results and Reasons for Improvement

According to project documentation and AWS blog data, in tests with 128 concurrent requests, the decoupled architecture achieves approximately a 70% throughput improvement over standard vLLM deployments. Reasons for the improvement:
1. Resources for the prefill and decode stages no longer interfere with each other (in traditional deployments, long prefill requests may block the decoding of other requests on the same GPU);
2. EPP's cache-aware routing reduces unnecessary KV cache recalculation (requests sharing system prompts are routed to the same decode Pod to reuse cache). Actual deployment logs show that the prefill Pod's generation throughput is close to 0 (about 0.1 tokens/s), focusing on prompt processing; the decode Pod focuses on token generation, with clear division of labor.

## Deployment Key Points and Applicable Scenarios

The deployment process is divided into three stages: 1. Terraform creates the EKS cluster and infrastructure (about 20-25 minutes); 2. Configure HuggingFace access tokens and namespaces; 3. Helmfile deploys llm-d components. Key details: EFA interfaces must be set to efa-only mode (required for p5.48xlarge instances); GPU nodes are placed in a cluster placement group to ensure physical proximity; security groups need to be configured with self-referencing rules to allow RDMA communication. Applicable scenarios: High-concurrency online inference services, applications with large variations in input prompt length, production environments with fine-grained resource management, high-throughput batch inference tasks. Limitations: Increases system complexity, requires additional EFA infrastructure, has high requirements for Kubernetes operation and maintenance experience; for low-concurrency or simple request pattern scenarios, standard deployment is more economical.

## Value and Summary of Decoupled Inference

Decoupled inference represents an important direction for LLM inference optimization. By splitting compute-intensive and memory-intensive workloads and using high-speed interconnection technology to eliminate communication bottlenecks, it provides significant performance improvements for large-scale inference services. This open-source project provides a complete reference implementation and detailed deployment documentation, making it an extremely valuable practical reference for teams building or optimizing LLM inference infrastructure.
