Reading

InferenceGateway: Design and Implementation of a High-Performance LLM Inference Gateway

An in-depth analysis of a C++-based high-throughput LLM service frontend, exploring its core mechanisms such as asynchronous batch scheduling, load-aware routing, Prometheus metric collection, and how it maintains sub-10ms scheduling latency at 8000 requests per second.

LLM推理负载均衡C++高性能Prometheus监控请求路由vLLM异步批处理Power of Two Choices

Published 2026-04-28 16:43Recent activity 2026-04-28 16:48Estimated read 6 min

InferenceGateway: Design and Implementation of a High-Performance LLM Inference Gateway

Section 01

InferenceGateway Introduction: Core Design and Value of a High-Performance LLM Inference Gateway

InferenceGateway is a C++-based high-performance LLM inference request routing layer that focuses on intelligently distributing client requests to backend LLM service replicas, and does not handle model loading or inference computation. Its core mechanisms include asynchronous batch scheduling, load-aware routing (e.g., Power of Two Choices strategy), Prometheus metric collection, etc. It can maintain sub-10ms scheduling latency at a throughput of 8000 requests per second, and can be directly deployed in front of mainstream LLM services such as vLLM and sglang.

Section 02

Project Background and Design Goals

InferenceGateway is positioned as a pure request routing layer and does not undertake inference computation. During design, an OpenAI-compatible HTTP/JSON interface was chosen because mainstream LLM service stacks (vLLM, sglang, llama.cpp server mode, TGI) all natively support this format, allowing integration without modifying backend code.

Section 03

Architecture and Core Scheduling Strategies

Overall Architecture: Client requests enter the MPSC queue via the HTTP listener (cpp-httplib). The scheduler thread retrieves requests from it, selects a backend using a load balancing strategy, and forwards them. A single-threaded scheduler is used to ensure atomicity and predictability, avoiding lock contention.

Scheduling Strategies:

Round Robin: Simple but cannot handle load imbalance;
Power of Two Choices (default): Randomly select two backends, choose the one with fewer in-flight requests; O(1) complexity and load balancing effect close to optimal;
Least Load: Scan all backends and select the one with the globally least in-flight requests; O(N) complexity, which may become a bottleneck under high concurrency. The strategy is specified via command line at startup and cannot be switched during operation.

Section 04

Key Mechanisms: Batching and Health Checks

Asynchronous Batching: For backends that support batching (e.g., vLLM's /v1/completions), multiple small requests can be collected within a 500-microsecond window, merged and sent, and the results split and returned to improve throughput (opt-in feature).

Health Checks: Probe the backend's /v1/models endpoint every 5 seconds. If it fails twice consecutively, it is marked as unhealthy and excluded; once recovered, it is immediately re-included, solving the problem of blocked faulty nodes.

Section 05

Observability and Performance Verification

Metric Collection: Built-in Prometheus /metrics endpoint (implemented manually, no dependencies), collecting metrics including total requests, latency histogram, number of in-flight requests, backend health status, scheduling queue depth, etc.

Performance Testing: Achieved a sustained throughput of 8000 requests per second on 4 simulated backends, with P99 scheduling overhead latency <10ms. The test system covers unit, integration, and performance benchmarks.

Section 06

Engineering Implementation and Limitations

Engineering Details: Uses C++20, built with CMake, minimal dependencies (cpp-httplib and nlohmann/json are both vendored), and manually implements the MPSC queue to avoid external dependencies.

Limitations: No model loading function; streaming responses are passed through but without session affinity; lacks features like mTLS, authentication, rate limiting (requires a front-facing reverse proxy).

Future Directions: Reserved gRPC support; proto files have been drafted, and adding a listener requires about 300 lines of code.

Section 07

Practical Significance and Insights

InferenceGateway demonstrates the idea of building high-performance LLM infrastructure in resource-constrained scenarios: clear scope (only routing), pragmatic technical choices (HTTP/JSON instead of gRPC), measurable performance goals, and complete observability. For LLM service platform teams, it can serve as a lightweight production-ready reference implementation, valuable both for learning scheduling algorithms and for direct deployment.

InferenceGateway: Design and Implementation of a High-Performance LLM Inference Gateway

InferenceGateway Introduction: Core Design and Value of a High-Performance LLM Inference Gateway

Project Background and Design Goals

Architecture and Core Scheduling Strategies

Key Mechanisms: Batching and Health Checks

Observability and Performance Verification

Engineering Implementation and Limitations

Practical Significance and Insights

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model