Zing Forum

Reading

Deploying Phi-3 Mini on AWS: Building a Scalable LLM Inference Service with ECS and Terraform

A complete cloud-native solution demonstrating how to deploy the Microsoft Phi-3 Mini 3.8B model on AWS using ECS, Terraform, and HuggingFace TGI, enabling auto-scaling and zero-cost idle mode.

Phi-3AWS ECSTerraformHuggingFace TGI云原生自动扩缩容AWQ量化Server-Sent EventsLLM推理服务
Published 2026-05-15 16:14Recent activity 2026-05-15 16:19Estimated read 6 min
Deploying Phi-3 Mini on AWS: Building a Scalable LLM Inference Service with ECS and Terraform
1

Section 01

[Main Floor] Deploying Phi-3 Mini on AWS: Guide to Cloud-Native Scalable LLM Inference Service Solution

phi3-cloud-deployment is an open-source cloud-native LLM inference service deployment solution focused on running the Microsoft Phi-3 Mini 3.8B model on AWS with low cost and high scalability. Adopting the Infrastructure as Code (IaC) concept, it achieves automated deployment via Terraform. Core features include: HuggingFace TGI inference framework, AWQ 4-bit quantization optimization (about 2.3GB of VRAM), Server-Sent Events (SSE) streaming responses, ECS auto-scaling (0-3 instances), and zero-cost idle mode, providing developers and enterprises with a production-grade LLM service architecture template.

2

Section 02

Background: Project Objectives and Design Philosophy

This project aims to address the need for efficient operation of LLM inference services in cloud environments, with the goal of providing a low-cost and highly scalable deployment solution. Adopting the Infrastructure as Code (IaC) concept, it implements automated deployment via Terraform, avoiding the hassle and errors of manual configuration, allowing users to quickly obtain a production-ready LLM service architecture.

3

Section 03

Technical Architecture: Core Components and Layered Design

Frontend Layer

The static website is deployed on S3 + CloudFront CDN, supporting Server-Sent Events (SSE) streaming responses, allowing users to view model-generated tokens in real time.

Inference Service Layer

Based on the HuggingFace TGI framework, it runs the AWQ 4-bit quantized Phi-3 Mini 3.8B model (about 2.3GB of VRAM), supporting continuous batching and streaming generation to improve throughput and user experience.

Network and Load Balancing

Uses ALB to distribute traffic to the ECS cluster; all components are deployed in private subnets, accessing AWS services via VPC Endpoints to reduce network costs.

Security Mechanisms

nginx reverse proxy implements API Key authentication and CORS support; AWS WAF protects against web attacks; full communication is HTTPS-encrypted; deployment in private subnets ensures computing resources are not directly exposed.

4

Section 04

Cost Optimization: Auto-Scaling and Pay-as-You-Go Mechanism

Auto-scaling of 0-3 instances is achieved via ECS Capacity Provider; when idle, it scales down to 0 instances with no computing costs. Cost estimation: On-demand instances cost about $17 for 20 hours of active testing, Spot instances cost about $9, and idle state costs zero—suitable for budget-sensitive projects.

5

Section 05

Deployment and Usage: Process Experience and Notes

Deployment Process

  1. Clone the repository; 2. Configure Terraform variables; 3. Initialize and apply Terraform configuration (deploy ECR → build and push image → deploy application stack).

Usage Experience

Enter the API Key in the frontend to interact; SSE streaming responses provide a real-time generation experience. Note: When the service scales down to 0, the first request triggers a cold start (about 3-5 minutes), and the frontend has implemented an automatic retry mechanism.

6

Section 06

Conclusion and Value: Highlights, Applicable Scenarios, and Recommendations

Technical Highlights

  • Combines TGI framework (production-grade stability), AWQ quantization (reduces VRAM usage), Terraform modular IaC (simplifies management), and zero-cost scaling control;
  • Clear code structure with separated modules (network, image repository, etc.), MIT open-source license supports community improvements.

Applicable Scenarios

  • Startups quickly building LLM services;
  • Enterprises reducing AI operation costs;
  • Development teams needing scalable architecture;
  • Developers learning cloud-native AI deployment.

Value

Provides a ready-to-use deployment solution, demonstrates a cost-effective cloud-native way to run LLMs in cloud environments, and offers an excellent reference implementation for AI application deployment.