Zing Forum

Reading

LLM Inference Gateway: A Multi-Workload Routing Scheme Based on Consistent Hashing

An LLM inference gateway project built with Python and FastAPI, which implements an intelligent PR review agent function and uses the consistent hashing algorithm to route requests across multiple Ollama worker nodes, effectively maintaining KV cache hotness.

LLM推理网关一致性哈希FastAPIOllamaPR审查KV缓存优化
Published 2026-06-12 07:44Recent activity 2026-06-12 07:50Estimated read 8 min
LLM Inference Gateway: A Multi-Workload Routing Scheme Based on Consistent Hashing
1

Section 01

LLM Inference Gateway Project Guide: PR Review and Multi-Node Routing Optimization Scheme

Core Overview of the LLM Inference Gateway Project

This project is maintained by Suraj-1207 on GitHub (project name: llm-inference-gateway) and is a complete LLM application stack built with Python and FastAPI. Its core goal is to integrate large language model capabilities into software development processes (such as PR review) while solving the request routing problem in multi-Ollama worker node environments. It maintains KV cache hotness via the consistent hashing algorithm to improve inference performance.

The project includes an intelligent PR review agent function, provides modular components (GitHub data fetching, PR summary generation, ReAct agent review, LLM-as-Judge evaluation, etc.), and supports a hybrid architecture using both local and cloud models.

2

Section 02

Project Background: Challenges in PR Review and Multi-Node Inference

Project Background and Challenges

PR review is a key link in software development, but manual review is time-consuming and prone to omissions. Integrating LLM into PR workflows can automate summary generation and code review, but it faces two major challenges:

  1. Multi-node routing issue: Traditional round-robin strategies cause requests from the same session to be scattered, making KV cache unreusable and inference efficiency low;
  2. LLM capability integration: Need to balance the cost advantage of local models (e.g., Ollama) and the complex task processing capability of cloud APIs (e.g., Groq).

This project aims to build an efficient LLM inference gateway and PR review system to solve the above problems.

3

Section 03

Core Component Analysis: From PR Data to Intelligent Review

Detailed Explanation of Core Components

The project adopts a modular design, with key components including:

  1. GitHub Data Fetching Layer: github_fetcher.py interacts with the GitHub API to obtain raw materials such as PR metadata and diffs;
  2. PR Summary Generation: summarise_pr.py uses local Ollama models to generate PR summaries without external APIs;
  3. ReAct Agent Review: agent.py implements the ReAct loop, analyzes code changes, and publishes review comments via the GitHub API (depends on the Groq API);
  4. Evaluation Framework: eval.py uses the LLM-as-Judge mode to score the quality of generated content to support iterative optimization.
4

Section 04

Core of Inference Gateway: Consistent Hashing Routing Mechanism

Implementation of Consistent Hashing Routing

The inference gateway (gateway.py) solves multi-node routing issues via consistent hashing:

  • Routing Rules: Use X-Session-ID to identify sessions; requests from the same session are routed to the same Ollama node; fallback to round-robin when no session ID is present;
  • Algorithm Details: Build a hash ring using MD5 hashing; each node is configured with 150 virtual nodes to ensure only a small number of requests are reallocated when nodes are added or removed;
  • Monitoring Capability: Exposes the Prometheus /metrics endpoint, which can monitor request latency, error rate, node load, and other metrics.
5

Section 05

Deployment and Usage Scenario Demonstration

Deployment and Usage

Environment Requirements

  • Groq API Key (used by agent/eval components);
  • GitHub Token (repo read permission);
  • Ollama (run locally, need to pull the llama3.2 model).

Startup Steps

  1. Start multiple Ollama nodes: OLLAMA_HOST=127.0 .0.1:11434 ollama serve & (can start multiple ports);
  2. Start the gateway: python gateway.py, check status via curl http://localhost:8000/health.

Usage Examples

  • PR Summary: python summarise_pr.py psf/requests 6710 $GITHUB_TOKEN;
  • Auto Review: python agent.py psf/requests 6710 $GITHUB_TOKEN;
  • Performance Benchmark: python benchmark.py (compare latency between round-robin and consistent hashing).
6

Section 06

Technical Highlights and Value Insights

Technical Highlights

  1. Hybrid Architecture: Local Ollama handles lightweight tasks (e.g., summaries) to reduce costs, while cloud Groq handles complex tasks (e.g., reviews) to ensure quality;
  2. Session Affinity: Consistent hashing routing ensures KV cache reuse for requests from the same session, significantly improving inference speed;
  3. Observability: Built-in Prometheus monitoring, conforms to cloud-native design, facilitating operation and maintenance as well as performance optimization.
7

Section 07

Limitations and Improvement Directions

Limitations and Optimization Suggestions

Currently a functional prototype, it needs optimization for production environments:

  • Health Check: Add automatic node failure removal and recovery mechanisms;
  • Dynamic Scaling: Adjust the number of worker nodes automatically based on load;
  • Cache Strategy: Support partial KV cache sharing and cache preheating;
  • Request Priority: Distinguish between real-time and batch requests to optimize resource allocation.

This project provides a reference for production-level LLM inference gateways, and the consistent hashing routing scheme is worth learning from.