Zing Forum

Reading

vllm-gateway: An Open-Source Gateway for Team-Level LLM Inference Cost and Latency Attribution

A Go-based reverse proxy gateway for vLLM that supports team-level inference cost and latency attribution, integrates ClickHouse storage, Prometheus monitoring, and Grafana visualization, and is suitable for enterprise-level LLM service governance.

vLLMLLM推理成本归因延迟监控多租户网关PrometheusGrafanaClickHouseGo
Published 2026-06-02 06:45Recent activity 2026-06-02 06:49Estimated read 8 min
vllm-gateway: An Open-Source Gateway for Team-Level LLM Inference Cost and Latency Attribution
1

Section 01

[Open Source Project] vllm-gateway: A Team-Level Solution for LLM Inference Cost and Latency Attribution

vllm-gateway is a Go-based reverse proxy gateway for vLLM, designed to provide teams with precise attribution capabilities for LLM inference costs and latency. It integrates ClickHouse storage, Prometheus monitoring, and Grafana visualization, making it suitable for enterprise-level LLM service governance scenarios. It addresses core pain points such as resource consumption tracking and latency monitoring when multiple teams share an inference cluster, and supports multi-tenant isolation and billing.

2

Section 02

Project Background and Pain Points

With the widespread application of LLMs in enterprises, inference cost control and performance monitoring have become core challenges. Traditional vLLM deployments provide high-performance inference but lack fine-grained cost attribution capabilities:

  • Ambiguous Costs: Unable to distinguish resource consumption between different teams/projects;
  • Missing Latency Metrics: Lack of tracking for key indicators like Time to First Token (TTFT);
  • Insufficient Observability: No out-of-the-box monitoring dashboards;
  • Isolation Difficulties: Hard to achieve team-level resource isolation and billing under shared infrastructure.

As a lightweight proxy layer, vllm-gateway is specifically designed to solve these problems.

3

Section 03

Core Architecture and Functional Features

Architecture Design: Client request → Gateway (8080) → vLLM/Simulation Service (8000/8001) → ClickHouse (event storage + 15-second aggregation) → Prometheus (5-second collection) → Grafana dashboard; meanwhile, the gateway scrapes vLLM's /metrics endpoint every 15 seconds.

Key Features:

  1. Team-Level Attribution: Multi-tenant identification via HTTP headers X-Team-ID (required), X-Project (optional), and X-User-ID (optional);
  2. Streaming Response: Supports OpenAI-compatible SSE streaming responses and records TTFT metrics;
  3. API Compatibility: Supports OpenAI API endpoints like /v1/completions and /v1/chat/completions;
  4. Developer-Friendly: Provides a simulation environment (no GPU required) and supports 33% streaming request ratio simulation.
4

Section 04

Technical Implementation and Deployment Guide

Storage Layer: ClickHouse serves as the time-series database, containing three tables:

  • request_events: Raw request events (token count, latency, TTFT, etc.);
  • request_metrics: 15-second interval summary of team latency/TTFT percentiles;
  • vllm_system_metrics: vLLM system metrics (queue depth, number of running requests).

Metric Collection: The gateway actively scrapes vLLM's /metrics endpoint every 15 seconds to integrate system-level metrics.

Apple Silicon Support: Provides a Metal backend, which can be installed and started via scripts.

Deployment:

  • Single config.yaml configuration file;
  • Docker environment automatically overrides the hostname;
  • Example request: Send a POST request using curl with the X-Team-ID header;
  • Grafana dashboards: Two sets—Live (real-time metrics) and History (historical attribution data).
5

Section 05

Applicable Scenarios and Value

Enterprise Internal LLM Platform:

  • FinOps: Precisely track team inference expenses, support cost allocation and budget control;
  • Performance SLA: Define team-level service agreements based on TTFT and end-to-end latency;
  • Capacity Planning: Predict resource requirements based on historical data.

Multi-Tenant SaaS:

  • Usage Metering: Generate customer usage reports;
  • Rate Limiting: Extensible team-level rate limits;
  • Fault Isolation: Quickly identify the source of abnormal traffic.

R&D Efficiency:

  • Identify high-latency prompt patterns;
  • Optimize token usage efficiency;
  • Compare cost-effectiveness of different models.
6

Section 06

Summary and Recommendations

Summary: vllm-gateway fills the gap in enterprise-level governance capabilities within the vLLM ecosystem, making it suitable for teams that already use vLLM but lack multi-tenant attribution.

Recommended Adoption Path:

  1. Evaluation: Use ./scripts/dev.sh mock for local experience;
  2. Pilot: Select 1-2 teams to connect to production traffic for validation;
  3. Promotion: Establish cost allocation and performance optimization processes based on gateway data;
  4. Customization: Develop enhanced features like rate limiting and caching.

Limitations and Extensions: The current version lacks features such as rate limiting, caching layer, A/B testing, and cost estimation, which can be future expansion directions.

Open Source License: MIT license, allowing free modification and commercial use. The code structure is clear, facilitating secondary development.