Reading

vllm-gateway: An Open-Source Gateway for Team-Level LLM Inference Cost and Latency Attribution

A Go-based reverse proxy gateway for vLLM that supports team-level inference cost and latency attribution, integrates ClickHouse storage, Prometheus monitoring, and Grafana visualization, and is suitable for enterprise-level LLM service governance.

vLLMLLM推理成本归因延迟监控多租户网关PrometheusGrafanaClickHouseGo

Published 2026-06-02 06:45Recent activity 2026-06-02 06:49Estimated read 8 min

vllm-gateway: An Open-Source Gateway for Team-Level LLM Inference Cost and Latency Attribution

Section 01

[Open Source Project] vllm-gateway: A Team-Level Solution for LLM Inference Cost and Latency Attribution

vllm-gateway is a Go-based reverse proxy gateway for vLLM, designed to provide teams with precise attribution capabilities for LLM inference costs and latency. It integrates ClickHouse storage, Prometheus monitoring, and Grafana visualization, making it suitable for enterprise-level LLM service governance scenarios. It addresses core pain points such as resource consumption tracking and latency monitoring when multiple teams share an inference cluster, and supports multi-tenant isolation and billing.

Section 02

Project Background and Pain Points

With the widespread application of LLMs in enterprises, inference cost control and performance monitoring have become core challenges. Traditional vLLM deployments provide high-performance inference but lack fine-grained cost attribution capabilities:

Ambiguous Costs: Unable to distinguish resource consumption between different teams/projects;
Missing Latency Metrics: Lack of tracking for key indicators like Time to First Token (TTFT);
Insufficient Observability: No out-of-the-box monitoring dashboards;
Isolation Difficulties: Hard to achieve team-level resource isolation and billing under shared infrastructure.

As a lightweight proxy layer, vllm-gateway is specifically designed to solve these problems.

Section 03

Core Architecture and Functional Features

Architecture Design: Client request → Gateway (8080) → vLLM/Simulation Service (8000/8001) → ClickHouse (event storage + 15-second aggregation) → Prometheus (5-second collection) → Grafana dashboard; meanwhile, the gateway scrapes vLLM's /metrics endpoint every 15 seconds.

Key Features:

Team-Level Attribution: Multi-tenant identification via HTTP headers X-Team-ID (required), X-Project (optional), and X-User-ID (optional);
Streaming Response: Supports OpenAI-compatible SSE streaming responses and records TTFT metrics;
API Compatibility: Supports OpenAI API endpoints like /v1/completions and /v1/chat/completions;
Developer-Friendly: Provides a simulation environment (no GPU required) and supports 33% streaming request ratio simulation.

Section 04

Technical Implementation and Deployment Guide

Storage Layer: ClickHouse serves as the time-series database, containing three tables:

request_events: Raw request events (token count, latency, TTFT, etc.);
request_metrics: 15-second interval summary of team latency/TTFT percentiles;
vllm_system_metrics: vLLM system metrics (queue depth, number of running requests).

Metric Collection: The gateway actively scrapes vLLM's /metrics endpoint every 15 seconds to integrate system-level metrics.

Apple Silicon Support: Provides a Metal backend, which can be installed and started via scripts.

Deployment:

Single config.yaml configuration file;
Docker environment automatically overrides the hostname;
Example request: Send a POST request using curl with the X-Team-ID header;
Grafana dashboards: Two sets—Live (real-time metrics) and History (historical attribution data).

Section 05

Applicable Scenarios and Value

Enterprise Internal LLM Platform:

FinOps: Precisely track team inference expenses, support cost allocation and budget control;
Performance SLA: Define team-level service agreements based on TTFT and end-to-end latency;
Capacity Planning: Predict resource requirements based on historical data.

Multi-Tenant SaaS:

Usage Metering: Generate customer usage reports;
Rate Limiting: Extensible team-level rate limits;
Fault Isolation: Quickly identify the source of abnormal traffic.

R&D Efficiency:

Identify high-latency prompt patterns;
Optimize token usage efficiency;
Compare cost-effectiveness of different models.

Section 06

Summary and Recommendations

Summary: vllm-gateway fills the gap in enterprise-level governance capabilities within the vLLM ecosystem, making it suitable for teams that already use vLLM but lack multi-tenant attribution.

Recommended Adoption Path:

Evaluation: Use ./scripts/dev.sh mock for local experience;
Pilot: Select 1-2 teams to connect to production traffic for validation;
Promotion: Establish cost allocation and performance optimization processes based on gateway data;
Customization: Develop enhanced features like rate limiting and caching.

Limitations and Extensions: The current version lacks features such as rate limiting, caching layer, A/B testing, and cost estimation, which can be future expansion directions.

Open Source License: MIT license, allowing free modification and commercial use. The code structure is clear, facilitating secondary development.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15