RouteLLM-rs's architecture design reflects the advantages of Rust in systems programming:
Request Reception Layer
The system exposes an interface compatible with the OpenAI API, receives client inference requests, and can be seamlessly integrated into the existing LLM application ecosystem as a transparent proxy.
Routing Decision Layer
After receiving a request, it extracts key features (such as model name, prompt content, parameter configuration, etc.) to calculate the hash value, and locates the target backend node on the consistent hash ring. Decision considerations include:
- Node Health Status: Regular health checks to automatically exclude faulty nodes;
- Current Load: Real-time monitoring of the number of concurrent requests and processing delays of each node;
- Cache Affinity: Prefer nodes that may have relevant caches.
Backend Connection Pool
Maintains a persistent connection pool with each backend inference node, avoiding the overhead of establishing a new connection for each request, and supports HTTP/2 multiplexing to improve throughput efficiency.
Response Handling and Monitoring
Responses are returned to the client in a streaming manner, while detailed metrics (routing decision time, backend processing delay, cache hit status, etc.) are recorded to provide data support for operation and maintenance optimization.
Deployment configuration uses TOML format, and typical configurations include:
- Backend Node List: Specify available inference service addresses and weights;
- Hash Strategy: Select hash algorithms (such as MurmurHash3, CityHash) and the number of virtual nodes;
- Health Check Parameters: Define check intervals, timeout periods, and failure thresholds;
- Cache Configuration: Enable/disable request/response caching, set cache size and expiration policies;
- Monitoring Endpoint: Configure the Prometheus metrics exposure port.