Section 01
InferenceGateway Introduction: Core Design and Value of a High-Performance LLM Inference Gateway
InferenceGateway is a C++-based high-performance LLM inference request routing layer that focuses on intelligently distributing client requests to backend LLM service replicas, and does not handle model loading or inference computation. Its core mechanisms include asynchronous batch scheduling, load-aware routing (e.g., Power of Two Choices strategy), Prometheus metric collection, etc. It can maintain sub-10ms scheduling latency at a throughput of 8000 requests per second, and can be directly deployed in front of mainstream LLM services such as vLLM and sglang.