Zing Forum

Reading

InferHub: A Self-Hosted LLM Inference Grid System Based on .NET

This article introduces InferHub, a self-hosted large language model (LLM) inference grid system built with .NET, which enables flexible distributed inference deployment through an Ollama-compatible API frontend and a GPU worker node pool.

LLM推理分布式系统OllamaGPU集群负载均衡自托管微服务架构API网关
Published 2026-06-12 05:44Recent activity 2026-06-12 05:51Estimated read 7 min
InferHub: A Self-Hosted LLM Inference Grid System Based on .NET
1

Section 01

InferHub: Introduction to the .NET-Based Self-Hosted LLM Inference Grid System

InferHub is a self-hosted LLM inference grid system developed by Dev-Art-Solutions, built on .NET. It decouples the Ollama-compatible API gateway from the GPU worker node pool to enable distributed inference deployment. Its core purpose is to solve the problem of tight coupling between inference services and GPU resources in traditional LLM deployments, offering advantages such as flexible resource reuse and cost optimization, and supporting self-hosted and hybrid deployment scenarios.

2

Section 02

Project Background and Core Concepts

Traditional LLM deployments suffer from tight coupling between inference services and GPU resources, leading to latency and complexity when remote calls are needed in GPU-less environments. InferHub uses a grid architecture to decouple the API gateway layer from the inference computing layer, enabling flexible resource deployment: gateways run on low-cost CPU servers, while the inference layer uses GPUs. It supports Ollama-compatible APIs, seamlessly integrating with the existing Ollama ecosystem—users can migrate without modifying client code.

3

Section 03

Architecture Design and Working Principles

InferHub uses a three-tier architecture: 1. API Gateway Layer (Hub): Receives requests, handles routing, load balancing, and failover; it is stateless and can be horizontally scaled. 2. Inference Node Layer (Nodes): GPU servers running Ollama, which register with the gateway and report their status. 3. Backend Adaptation Layer: A pluggable design that currently supports Ollama and will expand to vLLM and others in the future. Workflow: The client sends an Ollama-compatible request → the gateway selects the optimal node → forwards the request → returns the result. The process is transparent to the client.

4

Section 04

Technology Selection: Why Choose .NET

Reasons for InferHub choosing .NET: 1. Performance and Efficiency: Asynchronous programming (async/await) efficiently manages concurrent connections. 2. Ecosystem: Rich enterprise-level libraries and mature toolchains, suitable for long-term maintenance. 3. Cross-Platform Support: Can run on Linux, Windows, and macOS, enabling flexible deployment.

5

Section 05

Application Scenarios and Core Advantages

Application scenarios include: Multi-tenant inference services (sharing GPU pools to improve ROI), hybrid cloud deployment (private GPU nodes + public gateways), edge inference (edge gateways + central GPU clusters), and development testing (local gateways connecting to shared GPUs). Core advantages: Self-hosting first (data privacy and cost control), incremental adoption (Ollama compatibility allows no code rewriting), and pluggable architecture (supports more backends in the future).

6

Section 06

Key Deployment Considerations

Deployment considerations: 1. Network: Stable and low-latency connections between gateways and nodes are required; cross-region deployments need optimization. 2. Security: Node authentication, TLS encryption, API key/JWT authentication, access control, and auditing. 3. Monitoring: GPU utilization/memory, request latency/success rate, node health, and number of failovers.

7

Section 07

Comparison with Similar Projects

Relationship between InferHub and similar projects: 1. With Ollama: Not a replacement, but an enhancement layer that turns a single Ollama instance into a distributed system. 2. With vLLM: vLLM focuses on single-node high performance, while InferHub focuses on multi-node coordination—they can complement each other. 3. With OpenRouter: OpenRouter is a managed multi-model service, while InferHub is a self-hosted solution; the former is suitable for prototyping, and the latter for production.

8

Section 08

Future Development Directions and Conclusion

Future directions: Expand to more backends (vLLM, TensorRT-LLM, etc.), advanced routing strategies (model caching, node selection based on complexity), auto-scaling, and WebSocket support. Conclusion: InferHub achieves flexibility and scalability through distributed coordination, making it suitable for teams using the .NET tech stack or enterprises needing self-hosted LLM services, providing a viable option for deployment on own infrastructure.