Section 01
Introduction to Practical Local Large Model Inference Server: Efficient Deployment of MoE Models on RTX5080
This article introduces llm-server, a local LLM inference server project optimized for consumer-grade GPUs. Based on the llama.cpp framework, it optimizes Mixture of Experts (MoE) model inference, achieving efficient inference of the Qwen3.5-35B-A3B model on NVIDIA RTX5080 16GB GPU with a performance of 75 tokens per second. It provides practical references for local AI deployment and addresses needs in scenarios such as privacy, offline use, and cost control.