Reading

Hands-On Decoupled Inference Architecture: Deploying llm-d on AWS EKS for 70% Throughput Improvement

This open-source project demonstrates how to deploy the llm-d decoupled inference framework on Amazon EKS. By separating the prefill and decode stages into different Pods and using EFA RDMA for millisecond-level KV cache transfer, it achieves up to a 70% improvement in LLM inference throughput.

分离式推理LLM推理优化llm-dKubernetesEFA RDMAKV缓存预填充解码分离AWS EKS

Published 2026-04-22 16:18Recent activity 2026-04-22 16:25Estimated read 9 min

Hands-On Decoupled Inference Architecture: Deploying llm-d on AWS EKS for 70% Throughput Improvement

Section 01

Hands-On Decoupled Inference Architecture: Guide to llm-d's 70% Throughput Improvement on AWS EKS

This article introduces an open-source project that deploys the llm-d decoupled inference framework on AWS EKS. By separating the prefill and decode stages into different Pods and using EFA RDMA for millisecond-level KV cache transfer, it ultimately achieves up to a 70% improvement in LLM inference throughput. This architecture addresses the resource requirement conflicts between the prefill (compute-intensive) and decode (memory bandwidth-intensive) stages in traditional inference deployments, providing an efficient solution for large-scale LLM inference services.

Section 02

Background and Core Principles of Decoupled Inference

LLM inference consists of two core stages: the prefill stage (compute-intensive, processing input prompts to generate KV cache) and the decode stage (memory bandwidth-intensive, generating output tokens one by one). Traditional deployments place both stages on the same GPU, leading to inefficiency due to different hardware requirements. The core of decoupled inference is to allocate the two stages to different hardware nodes, allowing each node to focus on its specialized tasks and improve overall efficiency.

Section 03

Design and Architecture of the llm-d Framework

llm-d is a Kubernetes-native distributed LLM inference framework whose core is to decouple the inference process into independent microservices. In the reference architecture, 2 prefill Pods (with tensor parallelism TP=4) and 1 decode Pod (TP=4) are deployed, and KV cache is transferred via high-speed network. The flexibility of this architecture lies in the ability to independently scale prefill/decode resources based on request characteristics (e.g., long prompts with short responses or vice versa).

Section 04

EFA RDMA: Key to High-Speed KV Cache Transfer

The challenge of decoupled inference is the latency of cross-node KV cache transfer. The project uses the RDMA capability of AWS EFA, allowing GPUs to directly write to remote memory, bypassing the OS kernel and TCP/IP protocol stack. Technically, it uses the NIXL library with the libfabric protocol. Actual tests show that KV transfer latency is about 2 milliseconds, and throughput exceeds 1GB/s. Using p5.48xlarge instances (equipped with 32 EFA interfaces, set to efa-only mode) and placing them in a cluster placement group further reduces network latency.

Section 05

Detailed Explanation of the Complete Infrastructure Architecture

The project provides Terraform configurations for one-click infrastructure deployment:

Network layer: A VPC across 4 availability zones, equipped with NAT gateways to provide network access for private GPU nodes; EFA-specific security groups are configured with self-referencing inbound and outbound rules to ensure RDMA traffic communication.
Compute layer: System nodes use m5.2xlarge instances to run Istio gateways, monitoring components, and EPP routers; GPU nodes use p5.48xlarge instances, each equipped with 8 H100 GPUs and 32 EFA interfaces, with a custom launch template configuring a 500GB gp3 storage volume.
Service mesh: Istio acts as the service mesh to handle traffic routing and load balancing; Gateway API in conjunction with Inference Extension CRDs implements inference-aware traffic management.
Intelligent routing: The EPP (Endpoint Picker) component implements cache-aware request routing, identifying decode Pods holding KV cache for specific requests to maximize cache reuse rate.
Observability: Prometheus and Grafana provide full vLLM metric monitoring capabilities.

Section 06

Performance Results and Reasons for Improvement

According to project documentation and AWS blog data, in tests with 128 concurrent requests, the decoupled architecture achieves approximately a 70% throughput improvement over standard vLLM deployments. Reasons for the improvement:

Resources for the prefill and decode stages no longer interfere with each other (in traditional deployments, long prefill requests may block the decoding of other requests on the same GPU);
EPP's cache-aware routing reduces unnecessary KV cache recalculation (requests sharing system prompts are routed to the same decode Pod to reuse cache). Actual deployment logs show that the prefill Pod's generation throughput is close to 0 (about 0.1 tokens/s), focusing on prompt processing; the decode Pod focuses on token generation, with clear division of labor.

Section 07

Deployment Key Points and Applicable Scenarios

The deployment process is divided into three stages: 1. Terraform creates the EKS cluster and infrastructure (about 20-25 minutes); 2. Configure HuggingFace access tokens and namespaces; 3. Helmfile deploys llm-d components. Key details: EFA interfaces must be set to efa-only mode (required for p5.48xlarge instances); GPU nodes are placed in a cluster placement group to ensure physical proximity; security groups need to be configured with self-referencing rules to allow RDMA communication. Applicable scenarios: High-concurrency online inference services, applications with large variations in input prompt length, production environments with fine-grained resource management, high-throughput batch inference tasks. Limitations: Increases system complexity, requires additional EFA infrastructure, has high requirements for Kubernetes operation and maintenance experience; for low-concurrency or simple request pattern scenarios, standard deployment is more economical.

Section 08

Value and Summary of Decoupled Inference

Decoupled inference represents an important direction for LLM inference optimization. By splitting compute-intensive and memory-intensive workloads and using high-speed interconnection technology to eliminate communication bottlenecks, it provides significant performance improvements for large-scale inference services. This open-source project provides a complete reference implementation and detailed deployment documentation, making it an extremely valuable practical reference for teams building or optimizing LLM inference infrastructure.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49