Reading

Hoosh: An AI Inference Gateway Built with Rust, Unifying 14 LLM Providers

A feature-rich Rust AI inference gateway that supports unified routing for 14 LLM providers, local model services, speech-to-text, and token budget management, offering an OpenAI-compatible API and designed for production environments.

RustAI网关LLM路由多提供商OllamaOpenAIToken预算生产环境负载均衡

Published 2026-03-29 15:05Recent activity 2026-03-29 15:25Estimated read 8 min

Hoosh: An AI Inference Gateway Built with Rust, Unifying 14 LLM Providers

Section 01

Hoosh: Introduction to the High-Performance AI Inference Gateway Built with Rust

Hoosh is a high-performance AI inference gateway written in Rust, designed to solve problems in AI application development such as switching between multiple LLM providers and balancing cost and performance between local inference and cloud APIs. It supports unified routing and scheduling for 14 LLM providers, covering both local (e.g., Ollama, llama.cpp) and cloud (e.g., OpenAI, Anthropic) resources, provides an OpenAI-compatible API, and has enterprise-grade features like security and observability required for production environments.

Section 02

Project Background and Positioning

In AI application development practice, developers often face challenges such as flexible switching between multiple LLM providers and balancing cost and performance between local and cloud resources. Hoosh is positioned as an infrastructure layer for AI applications; it does not handle model training or file management, but focuses on efficiently and reliably routing and scheduling LLM inference requests. Its design philosophy includes: local-first (prioritizing on-device inference with cloud as backup), hardware-aware (automatically detecting GPU/TPU/NPU to optimize model placement), and production-ready (built-in enterprise features like authentication and rate limiting).

Section 03

Core Capabilities and Supported LLM Providers

Hoosh supports 14 LLM providers, covering the full spectrum of local and cloud resources:

Local Backends

Ollama: A popular local LLM runtime solution
llama.cpp: High-performance C++ inference engine
Synapse: Self-developed inference backend by the author
LM Studio: User-friendly local model management tool
LocalAI: OpenAI-compatible local API server

Cloud APIs

OpenAI, Anthropic, DeepSeek, Mistral, Google
Groq, Grok, OpenRouter

Voice Capabilities

Whisper: Speech-to-text based on whisper.cpp
Piper: Text-to-speech (optional)

Developers can freely combine local and cloud resources via a unified interface, and flexibly schedule based on cost, latency, and privacy requirements.

Section 04

Architecture Design and Key Features

Hoosh adopts a layered and decoupled architecture:

Authentication Layer: Bearer Token authentication (constant-time comparison to prevent timing attacks)
Rate Limiter: Limits traffic by RPM
Router: Selects providers based on priority, round-robin, or lowest latency strategies; supports model pattern matching (e.g., models starting with llama/mistral are routed to Ollama)

Token budget management is a core feature: Token pools are allocated per proxy, enabling lifecycle management of reservation, submission, and release to ensure fair resource allocation among multi-tenants and prevent a single proxy from exhausting quotas.

Section 05

Detailed Enterprise-Grade Features

Hoosh has rich enterprise-grade features:

Security and Authentication

Bearer Token authentication (prevents timing attacks)
TLS certificate pinning (prevents man-in-the-middle attacks)
Mutual TLS authentication for local backends

Observability

Prometheus metrics endpoint (exposes latency, throughput, etc.)
Optional OpenTelemetry distributed tracing
Encrypted audit logs (HMAC+SHA2 to protect integrity)

High Availability

Periodic health checks to automatically detect provider status
Automatic failover to backup providers in case of failures
Heartbeat tracking to ensure service continuity

Operation-Friendly

Hot reloading of configurations (no restart required)
Thread-safe cache (supports TTL)
Priority queue for request management

Section 06

Usage Methods and Ecosystem Integration

Hoosh supports two usage methods:

Command-line tool: Quickly start the gateway, perform single inference, list available models
HTTP API: Compatible with OpenAI format, enabling seamless migration of existing clients

Modular customization: Tailor features via Cargo features (e.g., enable only Ollama+llama.cpp or add voice capabilities) to adapt to scenarios from edge devices to enterprise-level.

Ecosystem integration: Collaborates with projects like AGNOS (system-level gateway), tarang (transcription/content description), AgnosAI (proxy team routing), Synapse (inference backend), forming a modular ecosystem.

Section 07

Technical Stack Highlights and Industry Insights

Hoosh is built based on the Rust ecosystem; its technical stack highlights include axum (HTTP service), reqwest (remote requests), prometheus (metrics), dashmap (thread-safe cache), and tokio (asynchronous runtime), ensuring high performance and security.

Insights for AI infrastructure: Focus on doing LLM request routing and scheduling well; through modular design and flexible combination with other tools, it is suitable for privacy-sensitive (local-first) or high-availability (multi-provider backup) scenarios, providing a reliable option for AI application teams.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15