Zing Forum

Reading

vLLM Inference Observability Console: Three-Tier Architecture for Real-Time Telemetry and Visual Analysis

An open-source project based on the React+Node+FastAPI three-tier architecture, providing a real-time monitoring dashboard for vLLM inference services, supporting concurrent SSE streaming, scheduler status monitoring, KV cache metric tracking, and batch analysis functions.

vLLMLLM推理可观测性监控仪表板ReactFastAPISSE流式传输性能分析KV缓存连续批处理
Published 2026-06-04 03:44Recent activity 2026-06-04 03:48Estimated read 5 min
vLLM Inference Observability Console: Three-Tier Architecture for Real-Time Telemetry and Visual Analysis
1

Section 01

Introduction: Core Overview of the vLLM Inference Observability Console Project

This open-source project is based on the React+Node+FastAPI three-tier architecture, providing a real-time monitoring dashboard for vLLM inference services. It supports concurrent SSE streaming, scheduler status monitoring, KV cache metric tracking, and batch analysis functions, addressing the limitations of traditional command-line monitoring and improving system observability and maintainability.

2

Section 02

Project Background and Motivation

In LLM inference production environments, observability is key to system stability and performance optimization. After vLLM was widely adopted, developers needed to monitor metrics like token latency and scheduler status in real time, but traditional command-line methods could not meet the need for intuitive interaction. This project is a modern refactoring of the Streamlit version of the vLLM monitoring dashboard, using a three-tier architecture to enhance user experience and system scalability.

3

Section 03

Detailed Explanation of Three-Tier Architecture Design

The project uses a three-tier separated architecture: 1. React Frontend (based on Vite): Provides status panels, model switching, real-time SSE token stream display, visual charts, CSV export, and other functions; 2. Node/Express BFF Layer: Handles CORS, hides GPU addresses, stream proxying, connection management, and supports extensions; 3. FastAPI + vLLM Inference Layer: Supports both real (GPU running) and Mock (GPU-free development) modes, with consistent APIs for easy switching.

4

Section 04

Core Functions and Test Scenarios

Core functions include: concurrent SSE streaming (simulating multi-user scenarios with three simultaneous requests), scheduler status monitoring (number of active requests, KV cache status, etc.), batch analysis (visualization of metrics like TTFT/ITL/throughput); built-in three test cases (short/medium/long prompt scenarios), supporting independent or combined execution and model A/B comparison.

5

Section 05

Quick Start and Technical Highlights

Provides a one-click startup script (supports macOS/Linux/Windows). Manual startup requires running the inference server, BFF service, and frontend service in sequence; technical highlights include microsecond-level timestamp precision processing, cancelable request mechanism, and lab-style dark theme UI design.

6

Section 06

Extension Directions and Future Plans

The model comparison function with ready infrastructure is to be implemented; potential extension directions: user authentication and access control, multi-GPU cluster monitoring, Prometheus/Grafana integration, custom test case import.

7

Section 07

Project Summary and Insights

This project demonstrates the evolution from a prototype tool to a production-ready system. The three-tier architecture addresses technical limitations of the original implementation (such as CORS and address exposure), laying the foundation for long-term system evolution. For LLM inference service teams, it provides a reference for a complete observability solution, and its architectural design and engineering practices are worth learning from.