Section 01
Hands-On Distributed LLM Inference System: Guide to Thousand-Level Concurrency Architecture Design
Hands-On Distributed LLM Inference System: Guide to Thousand-Level Concurrency Architecture Design
This project is an open-source project for the CSE354 Distributed Computing course, aiming to build a distributed LLM inference system that supports over 1000 concurrent users while balancing low latency and high availability. The system integrates RAG enhancement capabilities, three load balancing strategies, and a complete fault tolerance mechanism. It has been verified on the RTX A6000 GPU of the Thunder Compute platform using the Llama 3.2 1B model, providing a practical architectural reference for LLM service deployment in production environments.