Zing Forum

Reading

Practical Guide to Deploying Streaming LLM Inference Services on AWS EKS

A complete Terraform infrastructure project demonstrating how to deploy vLLM inference services on Amazon EKS to achieve production-grade streaming LLM inference capabilities.

vLLMEKSAWSKubernetesLLM推理流式生成Terraform云原生部署实践
Published 2026-04-27 04:40Recent activity 2026-04-27 04:52Estimated read 8 min
Practical Guide to Deploying Streaming LLM Inference Services on AWS EKS
1

Section 01

Introduction / Main Post: Practical Guide to Deploying Streaming LLM Inference Services on AWS EKS

A complete Terraform infrastructure project demonstrating how to deploy vLLM inference services on Amazon EKS to achieve production-grade streaming LLM inference capabilities.

2

Section 02

Project Overview

With the popularization of Large Language Models (LLMs) in enterprise applications, efficiently deploying inference services in cloud-native environments has become a key challenge. The open-source vllm-on-eks project by Nicolas-Richard provides a complete solution, demonstrating how to deploy vLLM on Amazon Elastic Kubernetes Service (EKS) to achieve production-grade streaming LLM inference capabilities.

As a supporting code repository, this project complements the blog post Streaming LLM inference on EKS and offers readers a complete practical path from infrastructure setup to application deployment.

3

Section 03

vLLM: High-Performance Inference Engine

vLLM is an open-source LLM inference and serving engine developed by the research team at the University of California, Berkeley. Its core innovations include:

  • PagedAttention Algorithm: By managing attention key-value caches with paging, it significantly reduces GPU memory fragmentation and improves throughput
  • Continuous Batching: Dynamically schedules requests to maximize GPU utilization
  • Streaming Generation: Supports streaming output in Server-Sent Events (SSE) format to enhance user experience
  • Multi-Model Support: Compatible with mainstream model architectures in the Hugging Face ecosystem
4

Section 04

Amazon EKS: Managed Kubernetes Service

Amazon EKS is a managed Kubernetes service provided by AWS, offering enterprise-grade container orchestration with:

  • Highly Available Control Plane: Automatically distributed across multiple Availability Zones
  • Security Integration: Deeply integrated with AWS IAM, VPC, and security groups
  • Elastic Scaling: Supports Cluster Autoscaler and Karpenter for automatic node scaling
  • GPU Instance Support: Provides high-performance GPU instance types like P4d and P5
5

Section 05

Layered Infrastructure Architecture

The project uses a clear layered architecture, dividing the infrastructure into two Terraform subprojects:

1. EKS Foundation Layer (infra/eks-foundation)

This layer is responsible for building and configuring the EKS cluster itself, including:

  • VPC Network: Configures private subnets, public subnets, and NAT gateways
  • EKS Control Plane: Creates and manages the Kubernetes cluster
  • Node Group Configuration: Sets up managed node groups with GPU instance types
  • Core Add-ons: Deploys essential components like CoreDNS, kube-proxy, and VPC CNI
  • IAM Roles and Permissions: Configures IAM roles for the cluster and nodes

2. Platform Application Layer (infra/platform-apps)

This layer deploys specific application components on top of the EKS cluster:

  • vLLM Service Deployment: Deploys vLLM inference services via Helm charts
  • FastAPI Gateway: Builds and deploys a custom API gateway service
  • Load Balancer: Configures AWS Application Load Balancer
  • Auto-Scaling: Sets up Horizontal Pod Autoscaler (HPA) for pod-level elasticity
  • Monitoring and Logging: Integrates CloudWatch or Prometheus for observability
6

Section 06

Container Image Management Strategy

The project adopts an intelligent image building and pushing strategy:

  1. ECR Repository: Creates a private image repository in AWS Elastic Container Registry
  2. Content Hash Trigger: Listens for content hash changes in FastAPI gateway code via the terraform_data resource
  3. Automatic Build and Push: Automatically triggers image rebuild and push when code changes

This design ensures consistency between image versions and code versions while avoiding unnecessary repeated builds.

7

Section 07

Detailed Deployment Workflow

The project encapsulates daily operations through the Makefile in the root directory, providing a concise command-line interface:

8

Section 08

Core Commands

Command Description
make deploy Full deployment process: Bootstrap ECR repository, then perform complete platform-apps application deployment
make ecr-bootstrap Only create ECR repository; required before first deployment
make terraform-apply Execute Terraform apply in infra/platform-apps
make destroy Destroy platform-apps resources (retains EKS cluster)
make gateway-url Output the public NLB URL of the gateway
make gateway-token Output the access token
make gateway-info Output both URL and token
make gateway-test Streaming chat completion test; output raw SSE blocks (for debugging)
make gateway-chat Streaming chat completion; output only assistant text to stdout