Reading

LLM Inference Gateway: A Multi-Workload Routing Scheme Based on Consistent Hashing

An LLM inference gateway project built with Python and FastAPI, which implements an intelligent PR review agent function and uses the consistent hashing algorithm to route requests across multiple Ollama worker nodes, effectively maintaining KV cache hotness.

LLM推理网关一致性哈希FastAPIOllamaPR审查KV缓存优化

Published 2026-06-12 07:44Recent activity 2026-06-12 07:50Estimated read 8 min

LLM Inference Gateway: A Multi-Workload Routing Scheme Based on Consistent Hashing

Section 01

LLM Inference Gateway Project Guide: PR Review and Multi-Node Routing Optimization Scheme

Core Overview of the LLM Inference Gateway Project

This project is maintained by Suraj-1207 on GitHub (project name: llm-inference-gateway) and is a complete LLM application stack built with Python and FastAPI. Its core goal is to integrate large language model capabilities into software development processes (such as PR review) while solving the request routing problem in multi-Ollama worker node environments. It maintains KV cache hotness via the consistent hashing algorithm to improve inference performance.

The project includes an intelligent PR review agent function, provides modular components (GitHub data fetching, PR summary generation, ReAct agent review, LLM-as-Judge evaluation, etc.), and supports a hybrid architecture using both local and cloud models.

Section 02

Project Background: Challenges in PR Review and Multi-Node Inference

Project Background and Challenges

PR review is a key link in software development, but manual review is time-consuming and prone to omissions. Integrating LLM into PR workflows can automate summary generation and code review, but it faces two major challenges:

Multi-node routing issue: Traditional round-robin strategies cause requests from the same session to be scattered, making KV cache unreusable and inference efficiency low;
LLM capability integration: Need to balance the cost advantage of local models (e.g., Ollama) and the complex task processing capability of cloud APIs (e.g., Groq).

This project aims to build an efficient LLM inference gateway and PR review system to solve the above problems.

Section 03

Core Component Analysis: From PR Data to Intelligent Review

Detailed Explanation of Core Components

The project adopts a modular design, with key components including:

GitHub Data Fetching Layer: github_fetcher.py interacts with the GitHub API to obtain raw materials such as PR metadata and diffs;
PR Summary Generation: summarise_pr.py uses local Ollama models to generate PR summaries without external APIs;
ReAct Agent Review: agent.py implements the ReAct loop, analyzes code changes, and publishes review comments via the GitHub API (depends on the Groq API);
Evaluation Framework: eval.py uses the LLM-as-Judge mode to score the quality of generated content to support iterative optimization.

Section 04

Core of Inference Gateway: Consistent Hashing Routing Mechanism

Implementation of Consistent Hashing Routing

The inference gateway (gateway.py) solves multi-node routing issues via consistent hashing:

Routing Rules: Use X-Session-ID to identify sessions; requests from the same session are routed to the same Ollama node; fallback to round-robin when no session ID is present;
Algorithm Details: Build a hash ring using MD5 hashing; each node is configured with 150 virtual nodes to ensure only a small number of requests are reallocated when nodes are added or removed;
Monitoring Capability: Exposes the Prometheus /metrics endpoint, which can monitor request latency, error rate, node load, and other metrics.

Section 05

Deployment and Usage Scenario Demonstration

Deployment and Usage

Environment Requirements

Groq API Key (used by agent/eval components);
GitHub Token (repo read permission);
Ollama (run locally, need to pull the llama3.2 model).

Startup Steps

Start multiple Ollama nodes: OLLAMA_HOST=127.0 .0.1:11434 ollama serve & (can start multiple ports);
Start the gateway: python gateway.py, check status via curl http://localhost:8000/health.

Usage Examples

PR Summary: python summarise_pr.py psf/requests 6710 $GITHUB_TOKEN;
Auto Review: python agent.py psf/requests 6710 $GITHUB_TOKEN;
Performance Benchmark: python benchmark.py (compare latency between round-robin and consistent hashing).

Section 06

Technical Highlights and Value Insights

Technical Highlights

Hybrid Architecture: Local Ollama handles lightweight tasks (e.g., summaries) to reduce costs, while cloud Groq handles complex tasks (e.g., reviews) to ensure quality;
Session Affinity: Consistent hashing routing ensures KV cache reuse for requests from the same session, significantly improving inference speed;
Observability: Built-in Prometheus monitoring, conforms to cloud-native design, facilitating operation and maintenance as well as performance optimization.

Section 07

Limitations and Improvement Directions

Limitations and Optimization Suggestions

Currently a functional prototype, it needs optimization for production environments:

Health Check: Add automatic node failure removal and recovery mechanisms;
Dynamic Scaling: Adjust the number of worker nodes automatically based on load;
Cache Strategy: Support partial KV cache sharing and cache preheating;
Request Priority: Distinguish between real-time and batch requests to optimize resource allocation.

This project provides a reference for production-level LLM inference gateways, and the consistent hashing routing scheme is worth learning from.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23