Section 01
Wukong-Serve: Introduction to the Production-Grade LLM Inference Service Framework
Wukong-Serve is a production-grade LLM inference service layer built on FastAPI, designed to address core challenges in providing stable, secure, and scalable services during LLM deployment. It offers enterprise-level encapsulation and governance capabilities for underlying inference engines like Ollama, integrating key features such as Bearer authentication, Redis token bucket rate limiting, Ollama circuit breaker, SSE streaming transmission, stateful session management, and Prometheus+Grafana observability solutions.