# llm-inference-bench: An LLM Inference Performance Benchmark Tool with Real-Time Dashboard

> An LLM inference decoding throughput benchmark tool supporting SGLang and vLLM, equipped with a Rich TUI real-time dashboard that measures token generation speed under different concurrency levels and context lengths.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-27T20:10:49.000Z
- 最近活动: 2026-04-27T20:19:16.098Z
- 热度: 161.9
- 关键词: LLM, benchmark, inference, vLLM, SGLang, throughput, performance-testing, TUI, GPU-monitoring
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-inference-bench-llm
- Canonical: https://www.zingnex.cn/forum/thread/llm-inference-bench-llm
- Markdown 来源: floors_fallback

---

## [Introduction] llm-inference-bench: An LLM Inference Performance Benchmark Tool with Real-Time Dashboard

Against the backdrop of widespread LLM deployment, traditional benchmarking tools face issues like single-dimensional metrics and difficulty reflecting actual production performance. llm-inference-bench was developed as a solution—it is a benchmark tool specifically designed for LLM inference decoding throughput, supporting mainstream engines SGLang and vLLM, and equipped with a Rich TUI real-time dashboard. It measures token generation speed across different concurrency levels and context lengths, covering all performance dimensions through matrix testing.

## Project Background and Design Philosophy

LLM inference performance evaluation involves multiple dimensions including concurrent processing, context length, and prefill efficiency. Existing tools suffer from limited scenario coverage and non-intuitive results. The design goal of llm-inference-bench is to fully cover key dimensions via matrix testing (combining concurrency levels and context lengths) and present results visually. Its core philosophy is "matrix testing + real-time monitoring".

## Core Features: Three-Layer Testing Mechanism

The tool offers three layers of testing:
1. **Prefill Test**: Measures input processing speed, records prompt_tokens and Time to First Token (TTFT), suitable for long-context scenarios like RAG;
2. **Continuous Decoding Test**: Default mode, maintains concurrency saturation for a fixed duration (30 seconds) to reflect real throughput under stable load;
3. **Burst/End-to-End Decoding Test**: Optional mode, simulates burst traffic with a fixed number of requests to evaluate peak response characteristics.

## Highlights of Real-Time Dashboard and Hardware Monitoring

The Rich library-based real-time TUI dashboard features an adaptive layout: wide screens display metrics like tok/s, TTFT/ITL; the hardware monitoring panel shows real-time GPU temperature, VRAM utilization, power consumption, and CPU status; the event log panel records events like preheating and readiness to track the testing process.

## Engine Support and Remote API Compatibility

Supports SGLang and vLLM engines with automatic type detection, and optionally fetches internal metrics from Prometheus/metrics endpoints; compatible with OpenAI API-formatted remote services (e.g., OpenRouter, Together AI), allowing cloud service testing with just an API key and model name.

## Intelligent Features and Usability Design

Built-in intelligent features: dynamic preheating (prioritizes scheduler metrics, falls back to OpenAI interface), automatic KV cache budget detection (skips over-capacity tests), effective concurrency detection (flags cases where actual concurrency is lower than requested), JSON output (saves structured results), and automatic update checks (detects new versions on startup).

## Usage Scenarios and Practical Value

Applicable to scenarios like inference service selection comparison, deployment parameter tuning, capacity planning, and performance regression testing; helps teams determine optimal batch size and KV cache strategy for open-source model deployment, and provides an objective performance comparison method for third-party API services.

## Summary and Outlook

llm-inference-bench provides a comprehensive LLM inference performance evaluation solution through matrix testing, real-time visual dashboard, and mainstream engine support. As LLM applications deepen, such tools will play a key role in model selection, system optimization, and operation & maintenance monitoring.
