Zing Forum

Reading

Hoosh: An AI Inference Gateway Built with Rust, Unifying 14 LLM Providers

A feature-rich Rust AI inference gateway that supports unified routing for 14 LLM providers, local model services, speech-to-text, and token budget management, offering an OpenAI-compatible API and designed for production environments.

RustAI网关LLM路由多提供商OllamaOpenAIToken预算生产环境负载均衡
Published 2026-03-29 15:05Recent activity 2026-03-29 15:25Estimated read 8 min
Hoosh: An AI Inference Gateway Built with Rust, Unifying 14 LLM Providers
1

Section 01

Hoosh: Introduction to the High-Performance AI Inference Gateway Built with Rust

Hoosh is a high-performance AI inference gateway written in Rust, designed to solve problems in AI application development such as switching between multiple LLM providers and balancing cost and performance between local inference and cloud APIs. It supports unified routing and scheduling for 14 LLM providers, covering both local (e.g., Ollama, llama.cpp) and cloud (e.g., OpenAI, Anthropic) resources, provides an OpenAI-compatible API, and has enterprise-grade features like security and observability required for production environments.

2

Section 02

Project Background and Positioning

In AI application development practice, developers often face challenges such as flexible switching between multiple LLM providers and balancing cost and performance between local and cloud resources. Hoosh is positioned as an infrastructure layer for AI applications; it does not handle model training or file management, but focuses on efficiently and reliably routing and scheduling LLM inference requests. Its design philosophy includes: local-first (prioritizing on-device inference with cloud as backup), hardware-aware (automatically detecting GPU/TPU/NPU to optimize model placement), and production-ready (built-in enterprise features like authentication and rate limiting).

3

Section 03

Core Capabilities and Supported LLM Providers

Hoosh supports 14 LLM providers, covering the full spectrum of local and cloud resources:

Local Backends

  • Ollama: A popular local LLM runtime solution
  • llama.cpp: High-performance C++ inference engine
  • Synapse: Self-developed inference backend by the author
  • LM Studio: User-friendly local model management tool
  • LocalAI: OpenAI-compatible local API server

Cloud APIs

  • OpenAI, Anthropic, DeepSeek, Mistral, Google
  • Groq, Grok, OpenRouter

Voice Capabilities

  • Whisper: Speech-to-text based on whisper.cpp
  • Piper: Text-to-speech (optional)

Developers can freely combine local and cloud resources via a unified interface, and flexibly schedule based on cost, latency, and privacy requirements.

4

Section 04

Architecture Design and Key Features

Hoosh adopts a layered and decoupled architecture:

  1. Authentication Layer: Bearer Token authentication (constant-time comparison to prevent timing attacks)
  2. Rate Limiter: Limits traffic by RPM
  3. Router: Selects providers based on priority, round-robin, or lowest latency strategies; supports model pattern matching (e.g., models starting with llama/mistral are routed to Ollama)

Token budget management is a core feature: Token pools are allocated per proxy, enabling lifecycle management of reservation, submission, and release to ensure fair resource allocation among multi-tenants and prevent a single proxy from exhausting quotas.

5

Section 05

Detailed Enterprise-Grade Features

Hoosh has rich enterprise-grade features:

Security and Authentication

  • Bearer Token authentication (prevents timing attacks)
  • TLS certificate pinning (prevents man-in-the-middle attacks)
  • Mutual TLS authentication for local backends

Observability

  • Prometheus metrics endpoint (exposes latency, throughput, etc.)
  • Optional OpenTelemetry distributed tracing
  • Encrypted audit logs (HMAC+SHA2 to protect integrity)

High Availability

  • Periodic health checks to automatically detect provider status
  • Automatic failover to backup providers in case of failures
  • Heartbeat tracking to ensure service continuity

Operation-Friendly

  • Hot reloading of configurations (no restart required)
  • Thread-safe cache (supports TTL)
  • Priority queue for request management
6

Section 06

Usage Methods and Ecosystem Integration

Hoosh supports two usage methods:

  1. Command-line tool: Quickly start the gateway, perform single inference, list available models
  2. HTTP API: Compatible with OpenAI format, enabling seamless migration of existing clients

Modular customization: Tailor features via Cargo features (e.g., enable only Ollama+llama.cpp or add voice capabilities) to adapt to scenarios from edge devices to enterprise-level.

Ecosystem integration: Collaborates with projects like AGNOS (system-level gateway), tarang (transcription/content description), AgnosAI (proxy team routing), Synapse (inference backend), forming a modular ecosystem.

7

Section 07

Technical Stack Highlights and Industry Insights

Hoosh is built based on the Rust ecosystem; its technical stack highlights include axum (HTTP service), reqwest (remote requests), prometheus (metrics), dashmap (thread-safe cache), and tokio (asynchronous runtime), ensuring high performance and security.

Insights for AI infrastructure: Focus on doing LLM request routing and scheduling well; through modular design and flexible combination with other tools, it is suitable for privacy-sensitive (local-first) or high-availability (multi-provider backup) scenarios, providing a reliable option for AI application teams.