Reading

Building a Production-Grade Asynchronous LLM Inference Service: Architecture Design and Engineering Practice

An asynchronous ML inference platform based on FastAPI, SQS, and ECS Fargate, supporting over 500 concurrent users, achieving full decoupling between the API layer and inference layer, with complete Terraform infrastructure-as-code and observability solutions.

LLMFastAPIAWSECS FargateSQSasync architectureMLOpsTerraform

Published 2026-04-08 16:12Recent activity 2026-04-08 16:24Estimated read 7 min

Section 01

Introduction / Main Floor: Building a Production-Grade Asynchronous LLM Inference Service: Architecture Design and Engineering Practice

Section 02

Problem Background: Why Synchronous Inference Is Not Scalable

Take a typical text classification scenario as an example. A forward pass using the DistilBERT model takes approximately 100-500 milliseconds. If a synchronous architecture is adopted, the API server must wait for inference to complete before responding to the client. This means:

Each concurrent request occupies a worker thread
500 concurrent users require 500 threads to be in a waiting state
Thread context switching overhead increases sharply
Memory usage grows linearly with concurrency
Any inference failure may cause the entire request to fail

This mode shows a clear performance cliff when the load increases— the system does not degrade gracefully but crashes directly.

Section 03

Core Idea of Asynchronous Architecture

The solution of this project draws on mature patterns from production systems like AWS SageMaker Asynchronous Inference, using a two-layer architecture:

Section 04

API Layer: Lightweight and Stateless

The API service is responsible for three things: validating requests, writing to the database, and sending queue messages. The response time is in milliseconds, completely independent of inference time. The client receives a job_id instead of the final result.

Section 05

Worker Layer: Independently Scalable Inference Cluster

Independent Worker services pull tasks from the queue, perform actual model inference, then write the results back to the database. The number of Workers can be dynamically adjusted based on queue depth, fully decoupled from the API layer's load.

Key advantages of this architecture:

API latency remains stable, unaffected by fluctuations in inference time
Workers can scale horizontally independently
A single Worker failure does not block the entire system
Supports longer inference timeout periods

Section 06

Infrastructure Components

The project uses AWS managed services to build the complete infrastructure:

Compute Layer: ECS Fargate provides a serverless container runtime environment. The API service is configured with 2 tasks (512 CPU / 1GB memory), and the Worker service supports auto-scaling between 1-10 tasks.

Queue System: SQS standard queues are used for task distribution, with a Dead Letter Queue (DLQ) to implement a three-time retry mechanism for failed tasks. Long polling mode (30-second visibility timeout) ensures efficient use of Workers.

Data Storage: DynamoDB uses the on-demand billing mode (PAY_PER_REQUEST), with a 24-hour TTL to automatically clean up completed tasks, eliminating the need for capacity planning.

Load Balancing: An Application Load Balancer (ALB) distributes traffic to the API service, supporting health checks and automatic failover.

Auto-Scaling: Ladder-style scaling (1→3→6→10) is triggered based on CloudWatch queue depth alarms, with faster response than traditional percentage-based scaling strategies.

Section 07

Observability Solution

A production system cannot do without comprehensive monitoring. This project uses the combination of Prometheus + Grafana, and uses Grafana Alloy as a sidecar to solve the metric collection problem in Fargate's dynamic IP environment:

The API layer exposes key metrics such as request count and latency histogram
The Worker layer runs a Prometheus service endpoint on port 9090
The Alloy sidecar pushes metrics to Grafana Cloud
Metrics from all components are centrally stored via remote_write

Section 08

Security Considerations

The project considers security at multiple levels:

API authentication uses HMAC signature verification, with constant-time comparison to prevent timing attacks
Secrets Manager centrally manages sensitive information such as API keys
IAM roles follow the principle of least privilege
Container images are hosted via ECR, with a lifecycle policy configured to retain the latest 5 versions

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15