Reading

InferHub: A Self-Hosted LLM Inference Grid System Based on .NET

This article introduces InferHub, a self-hosted large language model (LLM) inference grid system built with .NET, which enables flexible distributed inference deployment through an Ollama-compatible API frontend and a GPU worker node pool.

LLM推理分布式系统OllamaGPU集群负载均衡自托管微服务架构API网关

Published 2026-06-12 05:44Recent activity 2026-06-12 05:51Estimated read 7 min

InferHub: A Self-Hosted LLM Inference Grid System Based on .NET

Section 01

InferHub: Introduction to the .NET-Based Self-Hosted LLM Inference Grid System

InferHub is a self-hosted LLM inference grid system developed by Dev-Art-Solutions, built on .NET. It decouples the Ollama-compatible API gateway from the GPU worker node pool to enable distributed inference deployment. Its core purpose is to solve the problem of tight coupling between inference services and GPU resources in traditional LLM deployments, offering advantages such as flexible resource reuse and cost optimization, and supporting self-hosted and hybrid deployment scenarios.

Section 02

Project Background and Core Concepts

Traditional LLM deployments suffer from tight coupling between inference services and GPU resources, leading to latency and complexity when remote calls are needed in GPU-less environments. InferHub uses a grid architecture to decouple the API gateway layer from the inference computing layer, enabling flexible resource deployment: gateways run on low-cost CPU servers, while the inference layer uses GPUs. It supports Ollama-compatible APIs, seamlessly integrating with the existing Ollama ecosystem—users can migrate without modifying client code.

Section 03

Architecture Design and Working Principles

InferHub uses a three-tier architecture: 1. API Gateway Layer (Hub): Receives requests, handles routing, load balancing, and failover; it is stateless and can be horizontally scaled. 2. Inference Node Layer (Nodes): GPU servers running Ollama, which register with the gateway and report their status. 3. Backend Adaptation Layer: A pluggable design that currently supports Ollama and will expand to vLLM and others in the future. Workflow: The client sends an Ollama-compatible request → the gateway selects the optimal node → forwards the request → returns the result. The process is transparent to the client.

Section 04

Technology Selection: Why Choose .NET

Reasons for InferHub choosing .NET: 1. Performance and Efficiency: Asynchronous programming (async/await) efficiently manages concurrent connections. 2. Ecosystem: Rich enterprise-level libraries and mature toolchains, suitable for long-term maintenance. 3. Cross-Platform Support: Can run on Linux, Windows, and macOS, enabling flexible deployment.

Section 05

Application Scenarios and Core Advantages

Application scenarios include: Multi-tenant inference services (sharing GPU pools to improve ROI), hybrid cloud deployment (private GPU nodes + public gateways), edge inference (edge gateways + central GPU clusters), and development testing (local gateways connecting to shared GPUs). Core advantages: Self-hosting first (data privacy and cost control), incremental adoption (Ollama compatibility allows no code rewriting), and pluggable architecture (supports more backends in the future).

Section 06

Key Deployment Considerations

Deployment considerations: 1. Network: Stable and low-latency connections between gateways and nodes are required; cross-region deployments need optimization. 2. Security: Node authentication, TLS encryption, API key/JWT authentication, access control, and auditing. 3. Monitoring: GPU utilization/memory, request latency/success rate, node health, and number of failovers.

Section 07

Comparison with Similar Projects

Relationship between InferHub and similar projects: 1. With Ollama: Not a replacement, but an enhancement layer that turns a single Ollama instance into a distributed system. 2. With vLLM: vLLM focuses on single-node high performance, while InferHub focuses on multi-node coordination—they can complement each other. 3. With OpenRouter: OpenRouter is a managed multi-model service, while InferHub is a self-hosted solution; the former is suitable for prototyping, and the latter for production.

Section 08

Future Development Directions and Conclusion

Future directions: Expand to more backends (vLLM, TensorRT-LLM, etc.), advanced routing strategies (model caching, node selection based on complexity), auto-scaling, and WebSocket support. Conclusion: InferHub achieves flexibility and scalability through distributed coordination, making it suitable for teams using the .NET tech stack or enterprises needing self-hosted LLM services, providing a viable option for deployment on own infrastructure.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23