Zing Forum

Reading

InferRouter: A Self-Hosted Multi-Provider LLM Inference Proxy for .NET

InferRouter is a self-hosted LLM inference proxy designed for .NET projects, offering a unified OpenAI-compatible interface, supporting multi-provider failover, rate limit tracking, and structured operation logs to enable seamless model switching and local GGUF fallback.

.NETLLM proxyOpenAI compatiblemulti-providerfailoverGGUFLlamaSharprate limiting
Published 2026-05-27 02:42Recent activity 2026-05-27 02:49Estimated read 7 min
InferRouter: A Self-Hosted Multi-Provider LLM Inference Proxy for .NET
1

Section 01

InferRouter: Core Introduction to the Self-Hosted Multi-Provider LLM Inference Proxy for .NET

InferRouter is a self-hosted LLM inference proxy developed and maintained by vvidman, designed specifically for .NET projects. It was released on GitHub on May 26, 2026 (original link: https://github.com/vvidman/InferRouter). Its core features include: providing a unified OpenAI-compatible interface, supporting multi-provider failover, rate limit tracking, structured operation logs, and local GGUF model fallback (based on LlamaSharp), helping developers achieve seamless model switching and high availability.

2

Section 02

Challenges of LLM Multi-Provider Integration and Limitations of Traditional Solutions

With the development of the LLM ecosystem, developers face challenges in flexible switching between multiple providers: a single provider may have service outages, rate limits, or task adaptability issues. Traditional solutions require hard-coding multiple SDKs, manually handling failover, and managing API keys in a decentralized way, leading to high code complexity and difficulty in expansion. InferRouter aims to solve these problems by providing a unified interface and intelligent routing, allowing callers to enjoy multi-provider elasticity without awareness.

3

Section 03

Analysis of Core Architecture and Key Mechanisms

InferRouter adopts a layered architecture, with core components including:

  1. Unified API Layer: Exposes an OpenAI-compatible /v1/chat/completions endpoint externally, supporting seamless migration of all OpenAI clients.
  2. Failover Executor: Tries providers in the configured order, automatically switching to the next one when encountering recoverable errors (e.g., 429 rate limit).
  3. Rate Limit Tracker: Maintains local quota counts, supports UTC midnight reset and 60-second sliding window RPM tracking to avoid invalid requests.
  4. Error Normalizer: Converts errors from different providers into unified categories (RateLimit, AuthError, etc.) to ensure consistent failover logic.
  5. Operation Logs: Generates structured logs in JSONL format, including information such as request ID, provider, model, token consumption, etc., for easy monitoring and debugging.
4

Section 04

Flexible Configuration and Local GGUF Model Support

The provider chain is defined via configuration files, which can be adjusted without modifying the code. The configuration supports two types: openai_compatible (cloud providers compatible with OpenAI interface) and local_gguf (local models). The sample configuration includes quota control (daily request limit, per-minute limit) and error mapping rules. Local GGUF models are integrated via LlamaSharp, serving as the final fallback, running in-process, suitable for offline or privacy-sensitive scenarios.

5

Section 05

Security Design and Observability Assurance

Security: Uses Docker Secrets to manage API keys, which are mounted as files (/run/secrets/), avoiding environment variable leaks, supporting rotation without restarting the service. Observability: Operation logs are in JSONL format, including event types such as infer_started, infer_completed, infer_fallback, etc. They can be integrated with platforms like ELK and Grafana Loki to achieve real-time monitoring, alerting, and cost analysis.

6

Section 06

Deployment Methods and Applicable Scenarios

Tech Stack: Based on .NET 10 and ASP.NET Core Minimal API, local inference relies on LlamaSharp 0.20.0. Deployment: Deployed via Docker Compose with concise configuration, supporting key mounting, model directory, and log directory mapping. Applicable Scenarios: High availability requirements (multi-provider redundancy), cost optimization (prioritizing low-cost providers), model diversity (adapting different models for tasks), data privacy compliance (local models avoid data outflow).

7

Section 07

Summary: The Value and Significance of InferRouter

InferRouter promotes the evolution of LLM application architecture from tightly coupled single-provider to flexible, configurable multi-provider proxy, meeting the needs of production environments for security, observability, high availability, and cost-effectiveness. For .NET developers, it provides an out-of-the-box solution, eliminating the need to handle provider API differences or complex failover logic, and serves as an important abstraction layer in the evolution of the LLM ecosystem.