Zing Forum

Reading

ModelRelay: A Reverse Connection Proxy Solution for Private LLM Deployment

ModelRelay uses a reverse WebSocket connection mode, allowing GPU worker nodes to actively connect to the central proxy. It solves issues like port exposure, insufficient load balancing, and complex configuration in traditional LLM deployments, supporting streaming transmission, request queuing, and end-to-end cancellation.

LLM代理GPUWebSocket私有化部署负载均衡流式传输
Published 2026-04-04 17:12Recent activity 2026-04-04 17:24Estimated read 6 min
ModelRelay: A Reverse Connection Proxy Solution for Private LLM Deployment
1

Section 01

ModelRelay: Introduction to the Reverse Connection Proxy Solution for Private LLM Deployment

ModelRelay uses a reverse WebSocket connection mode to solve issues like port exposure, insufficient load balancing, and complex configuration in traditional private LLM deployments. It supports streaming transmission, request queuing, and end-to-end cancellation, enabling efficient management of GPU resources distributed across different network environments.

2

Section 02

Dilemmas of Traditional LLM Deployment Modes

Traditional private LLM deployment has three major pain points:

  1. Direct port exposure: Requires port forwarding, DNS, and firewall configuration, lacks high availability, and has no request queuing or cancellation functions;
  2. Traditional load balancers (e.g., nginx/HAProxy): Cannot understand LLM streaming semantics, do not support request queuing, worker node authentication, or cancellation propagation;
  3. Cloud routing services (e.g., LiteLLM/OpenRouter): Cloud-first architecture, not suitable for the 'home connection' scenario of private hardware.
3

Section 03

Reverse Connection Architecture Design of ModelRelay

The core innovation of ModelRelay lies in the reverse connection mode: The central proxy server (modelrelay-server) receives client requests, and GPU worker nodes (modelrelay-worker) actively connect to the proxy via WebSocket. Under this architecture, GPU servers do not need to open inbound ports—they can join the cluster as long as they can access the proxy. Flow: Client request → Central proxy ← WebSocket ← GPU worker node.

4

Section 04

Core Functional Features of ModelRelay

ModelRelay provides professional features for LLM inference scenarios:

  • Request queue management: Requests enter the queue to wait when busy, supporting timeout settings;
  • Streaming transmission pass-through: Ensures sequential forwarding of SSE blocks, maintaining real-time interaction experience;
  • End-to-end cancellation propagation: Client disconnection signals are passed to the backend, avoiding GPU resource waste;
  • Auto re-queueing: Requests re-enter the queue when worker nodes crash;
  • Heartbeat and load tracking: Monitors node health and load, intelligently routes requests.
5

Section 05

Deployment and Usage Methods of ModelRelay

ModelRelay supports multiple deployment methods:

  1. Docker deployment: Pull the image and start the proxy server and worker nodes (example commands see original text);
  2. Native binary: Download from Releases or install via Cargo (cargo install modelrelay-server modelrelay-worker).
6

Section 06

Configuration and Tuning of ModelRelay

Proxy server configuration: Listen address, worker node authentication key, queue depth (--max-queue-len), queue timeout (--queue-timeout), request timeout (--request-timeout); Worker node configuration: Proxy URL, authentication key, backend service address, supported model list, concurrent request limit (--max-concurrency) to control GPU memory.

7

Section 07

Applicable Scenarios of ModelRelay

ModelRelay is suitable for three types of users:

  1. Home GPU users: Multiple home computers running local models, unified API access without complex network configuration;
  2. Teams: Pooling resources of local GPU servers, simplifying operation and maintenance;
  3. Researchers: Flexible scheduling of models in heterogeneous hardware environments without updating client configurations.
8

Section 08

Value Summary of ModelRelay

ModelRelay solves the pain points of private LLM deployment through reverse connection architecture, simplifies network configuration, and provides professional features (queuing, streaming, cancellation, etc.). Its open-source nature and active maintenance provide guarantees for long-term development, making it an ideal tool for efficient use of GPU resources in private environments.