Zing Forum

Reading

llm-inference-bench: An LLM Inference Performance Benchmark Tool with Real-Time Dashboard

An LLM inference decoding throughput benchmark tool supporting SGLang and vLLM, equipped with a Rich TUI real-time dashboard that measures token generation speed under different concurrency levels and context lengths.

LLMbenchmarkinferencevLLMSGLangthroughputperformance-testingTUIGPU-monitoring
Published 2026-04-28 04:10Recent activity 2026-04-28 04:19Estimated read 6 min
llm-inference-bench: An LLM Inference Performance Benchmark Tool with Real-Time Dashboard
1

Section 01

[Introduction] llm-inference-bench: An LLM Inference Performance Benchmark Tool with Real-Time Dashboard

Against the backdrop of widespread LLM deployment, traditional benchmarking tools face issues like single-dimensional metrics and difficulty reflecting actual production performance. llm-inference-bench was developed as a solution—it is a benchmark tool specifically designed for LLM inference decoding throughput, supporting mainstream engines SGLang and vLLM, and equipped with a Rich TUI real-time dashboard. It measures token generation speed across different concurrency levels and context lengths, covering all performance dimensions through matrix testing.

2

Section 02

Project Background and Design Philosophy

LLM inference performance evaluation involves multiple dimensions including concurrent processing, context length, and prefill efficiency. Existing tools suffer from limited scenario coverage and non-intuitive results. The design goal of llm-inference-bench is to fully cover key dimensions via matrix testing (combining concurrency levels and context lengths) and present results visually. Its core philosophy is "matrix testing + real-time monitoring".

3

Section 03

Core Features: Three-Layer Testing Mechanism

The tool offers three layers of testing:

  1. Prefill Test: Measures input processing speed, records prompt_tokens and Time to First Token (TTFT), suitable for long-context scenarios like RAG;
  2. Continuous Decoding Test: Default mode, maintains concurrency saturation for a fixed duration (30 seconds) to reflect real throughput under stable load;
  3. Burst/End-to-End Decoding Test: Optional mode, simulates burst traffic with a fixed number of requests to evaluate peak response characteristics.
4

Section 04

Highlights of Real-Time Dashboard and Hardware Monitoring

The Rich library-based real-time TUI dashboard features an adaptive layout: wide screens display metrics like tok/s, TTFT/ITL; the hardware monitoring panel shows real-time GPU temperature, VRAM utilization, power consumption, and CPU status; the event log panel records events like preheating and readiness to track the testing process.

5

Section 05

Engine Support and Remote API Compatibility

Supports SGLang and vLLM engines with automatic type detection, and optionally fetches internal metrics from Prometheus/metrics endpoints; compatible with OpenAI API-formatted remote services (e.g., OpenRouter, Together AI), allowing cloud service testing with just an API key and model name.

6

Section 06

Intelligent Features and Usability Design

Built-in intelligent features: dynamic preheating (prioritizes scheduler metrics, falls back to OpenAI interface), automatic KV cache budget detection (skips over-capacity tests), effective concurrency detection (flags cases where actual concurrency is lower than requested), JSON output (saves structured results), and automatic update checks (detects new versions on startup).

7

Section 07

Usage Scenarios and Practical Value

Applicable to scenarios like inference service selection comparison, deployment parameter tuning, capacity planning, and performance regression testing; helps teams determine optimal batch size and KV cache strategy for open-source model deployment, and provides an objective performance comparison method for third-party API services.

8

Section 08

Summary and Outlook

llm-inference-bench provides a comprehensive LLM inference performance evaluation solution through matrix testing, real-time visual dashboard, and mainstream engine support. As LLM applications deepen, such tools will play a key role in model selection, system optimization, and operation & maintenance monitoring.