Reading

InferRouter: A Self-Hosted Multi-Provider LLM Inference Proxy for .NET

InferRouter is a self-hosted LLM inference proxy designed for .NET projects, offering a unified OpenAI-compatible interface, supporting multi-provider failover, rate limit tracking, and structured operation logs to enable seamless model switching and local GGUF fallback.

.NETLLM proxyOpenAI compatiblemulti-providerfailoverGGUFLlamaSharprate limiting

Published 2026-05-27 02:42Recent activity 2026-05-27 02:49Estimated read 7 min

InferRouter: A Self-Hosted Multi-Provider LLM Inference Proxy for .NET

Section 01

InferRouter: Core Introduction to the Self-Hosted Multi-Provider LLM Inference Proxy for .NET

InferRouter is a self-hosted LLM inference proxy developed and maintained by vvidman, designed specifically for .NET projects. It was released on GitHub on May 26, 2026 (original link: https://github.com/vvidman/InferRouter). Its core features include: providing a unified OpenAI-compatible interface, supporting multi-provider failover, rate limit tracking, structured operation logs, and local GGUF model fallback (based on LlamaSharp), helping developers achieve seamless model switching and high availability.

Section 02

Challenges of LLM Multi-Provider Integration and Limitations of Traditional Solutions

With the development of the LLM ecosystem, developers face challenges in flexible switching between multiple providers: a single provider may have service outages, rate limits, or task adaptability issues. Traditional solutions require hard-coding multiple SDKs, manually handling failover, and managing API keys in a decentralized way, leading to high code complexity and difficulty in expansion. InferRouter aims to solve these problems by providing a unified interface and intelligent routing, allowing callers to enjoy multi-provider elasticity without awareness.

Section 03

Analysis of Core Architecture and Key Mechanisms

InferRouter adopts a layered architecture, with core components including:

Unified API Layer: Exposes an OpenAI-compatible /v1/chat/completions endpoint externally, supporting seamless migration of all OpenAI clients.
Failover Executor: Tries providers in the configured order, automatically switching to the next one when encountering recoverable errors (e.g., 429 rate limit).
Rate Limit Tracker: Maintains local quota counts, supports UTC midnight reset and 60-second sliding window RPM tracking to avoid invalid requests.
Error Normalizer: Converts errors from different providers into unified categories (RateLimit, AuthError, etc.) to ensure consistent failover logic.
Operation Logs: Generates structured logs in JSONL format, including information such as request ID, provider, model, token consumption, etc., for easy monitoring and debugging.

Section 04

Flexible Configuration and Local GGUF Model Support

The provider chain is defined via configuration files, which can be adjusted without modifying the code. The configuration supports two types: openai_compatible (cloud providers compatible with OpenAI interface) and local_gguf (local models). The sample configuration includes quota control (daily request limit, per-minute limit) and error mapping rules. Local GGUF models are integrated via LlamaSharp, serving as the final fallback, running in-process, suitable for offline or privacy-sensitive scenarios.

Section 05

Security Design and Observability Assurance

Security: Uses Docker Secrets to manage API keys, which are mounted as files (/run/secrets/), avoiding environment variable leaks, supporting rotation without restarting the service. Observability: Operation logs are in JSONL format, including event types such as infer_started, infer_completed, infer_fallback, etc. They can be integrated with platforms like ELK and Grafana Loki to achieve real-time monitoring, alerting, and cost analysis.

Section 06

Deployment Methods and Applicable Scenarios

Tech Stack: Based on .NET 10 and ASP.NET Core Minimal API, local inference relies on LlamaSharp 0.20.0. Deployment: Deployed via Docker Compose with concise configuration, supporting key mounting, model directory, and log directory mapping. Applicable Scenarios: High availability requirements (multi-provider redundancy), cost optimization (prioritizing low-cost providers), model diversity (adapting different models for tasks), data privacy compliance (local models avoid data outflow).

Section 07

Summary: The Value and Significance of InferRouter

InferRouter promotes the evolution of LLM application architecture from tightly coupled single-provider to flexible, configurable multi-provider proxy, meeting the needs of production environments for security, observability, high availability, and cost-effectiveness. For .NET developers, it provides an out-of-the-box solution, eliminating the need to handle provider API differences or complex failover logic, and serves as an important abstraction layer in the evolution of the LLM ecosystem.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15