Zing Forum

Reading

Building a Production-Grade LLM Inference Engine: A Practical Guide to Dynamic Batching and Semantic Caching

Explore how to build a high-performance, low-latency large language model (LLM) inference service using dynamic batching, asynchronous queues, and Redis semantic caching technologies.

LLM推理引擎动态批处理语义缓存RedisFastAPI生产部署GPU优化vLLM大语言模型
Published 2026-06-17 01:15Recent activity 2026-06-17 01:20Estimated read 6 min
Building a Production-Grade LLM Inference Engine: A Practical Guide to Dynamic Batching and Semantic Caching
1

Section 01

Building a Production-Grade LLM Inference Engine: Core Solutions and Value

This article introduces an open-source project that explores how to build a high-performance, low-latency LLM inference service using dynamic batching, asynchronous queues, and Redis semantic caching technologies. The architecture draws on the concepts of vLLM and TensorRT-LLM, balancing latency, throughput, and resource utilization, making it suitable as a reference implementation for production-grade LLM service architectures.

2

Section 02

Background of Requirements for Production-Grade LLM Inference Engines

With the widespread deployment of LLMs, simple sequential processing easily becomes a bottleneck under high concurrency. Production environments need to balance high concurrency, low latency, throughput, and resource utilization. This project provides a complete solution with readable and extensible code, and its design concepts reference mature systems like vLLM and TensorRT-LLM.

3

Section 03

Layered Architecture Design of the Inference Engine

The engine adopts a three-layer architecture:

  1. FastAPI Gateway and Semantic Cache: User requests first pass through FastAPI, where the input is vectorized using all-MiniLM-L6-v2, and Redis is queried for semantically similar results (if similarity >0.8, return directly);
  2. Asynchronous Queue and Dynamic Batching: Requests that miss the cache are added to an asyncio queue, and batch processing is performed after waiting for 50ms or collecting 8 requests;
  3. Model Inference and Response Routing: Batched requests are sent to the model thread (supports GPT-Neo 1.3B/GPT-2 with automatic CUDA detection), and results are routed back to users.
4

Section 04

Key Technical Implementation Details

Semantic Cache: Uses all-MiniLM-L6-v2 to generate 384-dimensional vectors, Redis stores historical vectors, and cosine similarity calculation is used; this model is chosen for its balance between semantic understanding and efficiency. Dynamic Batching: A dual-threshold strategy of 50ms/8 requests, with a background thread monitoring the queue—if either condition is met, inference is executed to avoid blocking during low traffic.

5

Section 05

Performance Benchmark Test Results

Tested with k6 simulating 50 concurrent users:

  • Full cache scenario: p95 latency is 385ms, which is over 100x faster than the CPU mode without cache (39s);
  • Mixed load (30% repeated queries): cache hit rate is 51%, average response time is 32 seconds (CPU mode);
  • Dynamic batching is optimized for GPU deployment, and throughput improvement is more significant in GPU environments.
6

Section 06

Deployment and Operation Practices

Containerized deployment with Docker Compose is provided, enabling one-click startup of Redis and FastAPI services (first startup requires 5-7 minutes to download the model). Production tuning suggestions focus on: MAX_BATCH_SIZE (adjust based on GPU memory), BATCH_WAIT_MS (adjust based on traffic), MODEL_NAME (supports replacing with Hugging Face models). There is also a React+Recharts real-time monitoring dashboard to view metrics like throughput and latency.

7

Section 07

Application Scenarios and Expansion Directions

Application Scenarios: High-concurrency chat services, MaaS backends, edge deployments; Future Expansions: Multi-model support, streaming responses, INT8/INT4 quantization optimization, distributed deployment.

8

Section 08

Summary and Insights

This project demonstrates the key elements of a production-grade LLM inference service: dynamic batching balances latency and throughput, while semantic caching reduces computational costs. For teams planning or optimizing LLM services, it provides a validated architectural reference to facilitate practical engineering implementation.