# LLM Inference Gateway: A Multi-Workload Routing Scheme Based on Consistent Hashing

> An LLM inference gateway project built with Python and FastAPI, which implements an intelligent PR review agent function and uses the consistent hashing algorithm to route requests across multiple Ollama worker nodes, effectively maintaining KV cache hotness.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-11T23:44:43.000Z
- 最近活动: 2026-06-11T23:50:32.411Z
- 热度: 146.9
- 关键词: LLM推理网关, 一致性哈希, FastAPI, Ollama, PR审查, KV缓存优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-900a7990
- Canonical: https://www.zingnex.cn/forum/thread/llm-900a7990
- Markdown 来源: floors_fallback

---

## LLM Inference Gateway Project Guide: PR Review and Multi-Node Routing Optimization Scheme

### Core Overview of the LLM Inference Gateway Project
This project is maintained by Suraj-1207 on GitHub (project name: llm-inference-gateway) and is a complete LLM application stack built with Python and FastAPI. Its core goal is to integrate large language model capabilities into software development processes (such as PR review) while solving the request routing problem in multi-Ollama worker node environments. It maintains KV cache hotness via the consistent hashing algorithm to improve inference performance.

The project includes an intelligent PR review agent function, provides modular components (GitHub data fetching, PR summary generation, ReAct agent review, LLM-as-Judge evaluation, etc.), and supports a hybrid architecture using both local and cloud models.

## Project Background: Challenges in PR Review and Multi-Node Inference

### Project Background and Challenges
PR review is a key link in software development, but manual review is time-consuming and prone to omissions. Integrating LLM into PR workflows can automate summary generation and code review, but it faces two major challenges:
1. **Multi-node routing issue**: Traditional round-robin strategies cause requests from the same session to be scattered, making KV cache unreusable and inference efficiency low;
2. **LLM capability integration**: Need to balance the cost advantage of local models (e.g., Ollama) and the complex task processing capability of cloud APIs (e.g., Groq).

This project aims to build an efficient LLM inference gateway and PR review system to solve the above problems.

## Core Component Analysis: From PR Data to Intelligent Review

### Detailed Explanation of Core Components
The project adopts a modular design, with key components including:
1. **GitHub Data Fetching Layer**: `github_fetcher.py` interacts with the GitHub API to obtain raw materials such as PR metadata and diffs;
2. **PR Summary Generation**: `summarise_pr.py` uses local Ollama models to generate PR summaries without external APIs;
3. **ReAct Agent Review**: `agent.py` implements the ReAct loop, analyzes code changes, and publishes review comments via the GitHub API (depends on the Groq API);
4. **Evaluation Framework**: `eval.py` uses the LLM-as-Judge mode to score the quality of generated content to support iterative optimization.

## Core of Inference Gateway: Consistent Hashing Routing Mechanism

### Implementation of Consistent Hashing Routing
The inference gateway (`gateway.py`) solves multi-node routing issues via consistent hashing:
- **Routing Rules**: Use `X-Session-ID` to identify sessions; requests from the same session are routed to the same Ollama node; fallback to round-robin when no session ID is present;
- **Algorithm Details**: Build a hash ring using MD5 hashing; each node is configured with 150 virtual nodes to ensure only a small number of requests are reallocated when nodes are added or removed;
- **Monitoring Capability**: Exposes the Prometheus `/metrics` endpoint, which can monitor request latency, error rate, node load, and other metrics.

## Deployment and Usage Scenario Demonstration

### Deployment and Usage
#### Environment Requirements
- Groq API Key (used by agent/eval components);
- GitHub Token (repo read permission);
- Ollama (run locally, need to pull the `llama3.2` model).

#### Startup Steps
1. Start multiple Ollama nodes: `OLLAMA_HOST=127.0 .0.1:11434 ollama serve &` (can start multiple ports);
2. Start the gateway: `python gateway.py`, check status via `curl http://localhost:8000/health`.

#### Usage Examples
- PR Summary: `python summarise_pr.py psf/requests 6710 $GITHUB_TOKEN`;
- Auto Review: `python agent.py psf/requests 6710 $GITHUB_TOKEN`;
- Performance Benchmark: `python benchmark.py` (compare latency between round-robin and consistent hashing).

## Technical Highlights and Value Insights

### Technical Highlights
1. **Hybrid Architecture**: Local Ollama handles lightweight tasks (e.g., summaries) to reduce costs, while cloud Groq handles complex tasks (e.g., reviews) to ensure quality;
2. **Session Affinity**: Consistent hashing routing ensures KV cache reuse for requests from the same session, significantly improving inference speed;
3. **Observability**: Built-in Prometheus monitoring, conforms to cloud-native design, facilitating operation and maintenance as well as performance optimization.

## Limitations and Improvement Directions

### Limitations and Optimization Suggestions
Currently a functional prototype, it needs optimization for production environments:
- **Health Check**: Add automatic node failure removal and recovery mechanisms;
- **Dynamic Scaling**: Adjust the number of worker nodes automatically based on load;
- **Cache Strategy**: Support partial KV cache sharing and cache preheating;
- **Request Priority**: Distinguish between real-time and batch requests to optimize resource allocation.

This project provides a reference for production-level LLM inference gateways, and the consistent hashing routing scheme is worth learning from.
