Zing Forum

Reading

Building a Production-Grade Asynchronous LLM Inference Service: Architecture Design and Engineering Practice

An asynchronous ML inference platform based on FastAPI, SQS, and ECS Fargate, supporting over 500 concurrent users, achieving full decoupling between the API layer and inference layer, with complete Terraform infrastructure-as-code and observability solutions.

LLMFastAPIAWSECS FargateSQSasync architectureMLOpsTerraform
Published 2026-04-08 16:12Recent activity 2026-04-08 16:24Estimated read 7 min
Building a Production-Grade Asynchronous LLM Inference Service: Architecture Design and Engineering Practice
1

Section 01

Introduction / Main Floor: Building a Production-Grade Asynchronous LLM Inference Service: Architecture Design and Engineering Practice

An asynchronous ML inference platform based on FastAPI, SQS, and ECS Fargate, supporting over 500 concurrent users, achieving full decoupling between the API layer and inference layer, with complete Terraform infrastructure-as-code and observability solutions.

2

Section 02

Problem Background: Why Synchronous Inference Is Not Scalable

Take a typical text classification scenario as an example. A forward pass using the DistilBERT model takes approximately 100-500 milliseconds. If a synchronous architecture is adopted, the API server must wait for inference to complete before responding to the client. This means:

  • Each concurrent request occupies a worker thread
  • 500 concurrent users require 500 threads to be in a waiting state
  • Thread context switching overhead increases sharply
  • Memory usage grows linearly with concurrency
  • Any inference failure may cause the entire request to fail

This mode shows a clear performance cliff when the load increases— the system does not degrade gracefully but crashes directly.

3

Section 03

Core Idea of Asynchronous Architecture

The solution of this project draws on mature patterns from production systems like AWS SageMaker Asynchronous Inference, using a two-layer architecture:

4

Section 04

API Layer: Lightweight and Stateless

The API service is responsible for three things: validating requests, writing to the database, and sending queue messages. The response time is in milliseconds, completely independent of inference time. The client receives a job_id instead of the final result.

5

Section 05

Worker Layer: Independently Scalable Inference Cluster

Independent Worker services pull tasks from the queue, perform actual model inference, then write the results back to the database. The number of Workers can be dynamically adjusted based on queue depth, fully decoupled from the API layer's load.

Key advantages of this architecture:

  • API latency remains stable, unaffected by fluctuations in inference time
  • Workers can scale horizontally independently
  • A single Worker failure does not block the entire system
  • Supports longer inference timeout periods
6

Section 06

Infrastructure Components

The project uses AWS managed services to build the complete infrastructure:

Compute Layer: ECS Fargate provides a serverless container runtime environment. The API service is configured with 2 tasks (512 CPU / 1GB memory), and the Worker service supports auto-scaling between 1-10 tasks.

Queue System: SQS standard queues are used for task distribution, with a Dead Letter Queue (DLQ) to implement a three-time retry mechanism for failed tasks. Long polling mode (30-second visibility timeout) ensures efficient use of Workers.

Data Storage: DynamoDB uses the on-demand billing mode (PAY_PER_REQUEST), with a 24-hour TTL to automatically clean up completed tasks, eliminating the need for capacity planning.

Load Balancing: An Application Load Balancer (ALB) distributes traffic to the API service, supporting health checks and automatic failover.

Auto-Scaling: Ladder-style scaling (1→3→6→10) is triggered based on CloudWatch queue depth alarms, with faster response than traditional percentage-based scaling strategies.

7

Section 07

Observability Solution

A production system cannot do without comprehensive monitoring. This project uses the combination of Prometheus + Grafana, and uses Grafana Alloy as a sidecar to solve the metric collection problem in Fargate's dynamic IP environment:

  • The API layer exposes key metrics such as request count and latency histogram
  • The Worker layer runs a Prometheus service endpoint on port 9090
  • The Alloy sidecar pushes metrics to Grafana Cloud
  • Metrics from all components are centrally stored via remote_write
8

Section 08

Security Considerations

The project considers security at multiple levels:

  • API authentication uses HMAC signature verification, with constant-time comparison to prevent timing attacks
  • Secrets Manager centrally manages sensitive information such as API keys
  • IAM roles follow the principle of least privilege
  • Container images are hosted via ECR, with a lifecycle policy configured to retain the latest 5 versions