# Practical Guide to Deploying Streaming LLM Inference Services on AWS EKS

> A complete Terraform infrastructure project demonstrating how to deploy vLLM inference services on Amazon EKS to achieve production-grade streaming LLM inference capabilities.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-26T20:40:32.000Z
- 最近活动: 2026-04-26T20:52:11.977Z
- 热度: 161.8
- 关键词: vLLM, EKS, AWS, Kubernetes, LLM推理, 流式生成, Terraform, 云原生, 部署实践
- 页面链接: https://www.zingnex.cn/en/forum/thread/aws-eksllm
- Canonical: https://www.zingnex.cn/forum/thread/aws-eksllm
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: Practical Guide to Deploying Streaming LLM Inference Services on AWS EKS

A complete Terraform infrastructure project demonstrating how to deploy vLLM inference services on Amazon EKS to achieve production-grade streaming LLM inference capabilities.

## Project Overview

With the popularization of Large Language Models (LLMs) in enterprise applications, efficiently deploying inference services in cloud-native environments has become a key challenge. The open-source `vllm-on-eks` project by Nicolas-Richard provides a complete solution, demonstrating how to deploy vLLM on Amazon Elastic Kubernetes Service (EKS) to achieve production-grade streaming LLM inference capabilities.

As a supporting code repository, this project complements the blog post *Streaming LLM inference on EKS* and offers readers a complete practical path from infrastructure setup to application deployment.

## vLLM: High-Performance Inference Engine

vLLM is an open-source LLM inference and serving engine developed by the research team at the University of California, Berkeley. Its core innovations include:

- **PagedAttention Algorithm**: By managing attention key-value caches with paging, it significantly reduces GPU memory fragmentation and improves throughput
- **Continuous Batching**: Dynamically schedules requests to maximize GPU utilization
- **Streaming Generation**: Supports streaming output in Server-Sent Events (SSE) format to enhance user experience
- **Multi-Model Support**: Compatible with mainstream model architectures in the Hugging Face ecosystem

## Amazon EKS: Managed Kubernetes Service

Amazon EKS is a managed Kubernetes service provided by AWS, offering enterprise-grade container orchestration with:

- **Highly Available Control Plane**: Automatically distributed across multiple Availability Zones
- **Security Integration**: Deeply integrated with AWS IAM, VPC, and security groups
- **Elastic Scaling**: Supports Cluster Autoscaler and Karpenter for automatic node scaling
- **GPU Instance Support**: Provides high-performance GPU instance types like P4d and P5

## Layered Infrastructure Architecture

The project uses a clear layered architecture, dividing the infrastructure into two Terraform subprojects:

#### 1. EKS Foundation Layer (infra/eks-foundation)

This layer is responsible for building and configuring the EKS cluster itself, including:

- **VPC Network**: Configures private subnets, public subnets, and NAT gateways
- **EKS Control Plane**: Creates and manages the Kubernetes cluster
- **Node Group Configuration**: Sets up managed node groups with GPU instance types
- **Core Add-ons**: Deploys essential components like CoreDNS, kube-proxy, and VPC CNI
- **IAM Roles and Permissions**: Configures IAM roles for the cluster and nodes

#### 2. Platform Application Layer (infra/platform-apps)

This layer deploys specific application components on top of the EKS cluster:

- **vLLM Service Deployment**: Deploys vLLM inference services via Helm charts
- **FastAPI Gateway**: Builds and deploys a custom API gateway service
- **Load Balancer**: Configures AWS Application Load Balancer
- **Auto-Scaling**: Sets up Horizontal Pod Autoscaler (HPA) for pod-level elasticity
- **Monitoring and Logging**: Integrates CloudWatch or Prometheus for observability

## Container Image Management Strategy

The project adopts an intelligent image building and pushing strategy:

1. **ECR Repository**: Creates a private image repository in AWS Elastic Container Registry
2. **Content Hash Trigger**: Listens for content hash changes in FastAPI gateway code via the `terraform_data` resource
3. **Automatic Build and Push**: Automatically triggers image rebuild and push when code changes

This design ensures consistency between image versions and code versions while avoiding unnecessary repeated builds.

## Detailed Deployment Workflow

The project encapsulates daily operations through the `Makefile` in the root directory, providing a concise command-line interface:

## Core Commands

| Command | Description |
|---------|-------------|
| `make deploy` | Full deployment process: Bootstrap ECR repository, then perform complete `platform-apps` application deployment |
| `make ecr-bootstrap` | Only create ECR repository; required before first deployment |
| `make terraform-apply` | Execute Terraform apply in `infra/platform-apps` |
| `make destroy` | Destroy `platform-apps` resources (retains EKS cluster) |
| `make gateway-url` | Output the public NLB URL of the gateway |
| `make gateway-token` | Output the access token |
| `make gateway-info` | Output both URL and token |
| `make gateway-test` | Streaming chat completion test; output raw SSE blocks (for debugging) |
| `make gateway-chat` | Streaming chat completion; output only assistant text to stdout |