Zing Forum

Reading

GPUBench: A Single-GPU Inference Benchmark Tool for vLLM and Latency-Throughput Knee Point Analysis

GPUBench is a single-GPU large language model (LLM) inference benchmark framework specifically designed for vLLM. It uses a load generator with correct coordination omission handling, correlates service latency with GPU telemetry data, accurately locates the latency-throughput knee point, and cross-validates with vLLM's official bench serve.

LLM推理vLLMGPU基准测试性能分析延迟优化吞吐量协调遗漏膝点检测大模型部署
Published 2026-06-14 07:16Recent activity 2026-06-14 07:20Estimated read 8 min
GPUBench: A Single-GPU Inference Benchmark Tool for vLLM and Latency-Throughput Knee Point Analysis
1

Section 01

GPUBench Introduction: Core Overview of the Single-GPU Inference Benchmark Tool for vLLM

GPUBench is a single-GPU large language model (LLM) inference benchmark framework specifically designed for vLLM. Its core features include: using a load generation strategy with correct coordination omission handling, correlating service latency with GPU telemetry data, accurately locating the latency-throughput knee point, and cross-validating with vLLM's official bench serve. Original author/maintainer: Saibernard, Source platform: GitHub, Project link: https://github.com/Saibernard/llm_inference_benchmarking, Release time: 2026-06-13. Subsequent floors will detail its background, methods, validation mechanisms, and other content.

2

Section 02

Background: Pain Points of Traditional Benchmark Tools and the Birth of GPUBench

Traditional benchmark tools often have the 'coordination omission' problem: when the server slows down, clients send requests at a fixed rate, missing requests that should have been sent. This leads to artificially low measured latency, which fails to reflect real user experience. GPUBench was born to address this pain point and provide real service latency measurements.

3

Section 03

Core Methods and Metrics

Core Methods

GPUBench uses absolute arrival time scheduling (Poisson process), precomputes the expected arrival time of each request, records the difference between expected and actual send times, and eliminates the coordination omission problem.

Key Metrics

  • Latency categories: TTFT (Time to First Token, including prefill and queue waiting), TPOT/ITL (Time per Output Token/Inter-Token Latency), E2E Latency (end-to-end latency, providing P50/P95/P99 percentiles)
  • Throughput categories: Throughput (output tokens/sec, total tokens/sec, requests/sec), Goodput (throughput of requests meeting SLO)
  • GPU telemetry: Utilization, memory usage, power consumption, KV Cache occupancy
  • Reliability: Statistically count exceptions such as timeouts, HTTP errors, truncated streams by category
4

Section 04

Cross-Validation Mechanism: Ensuring Result Credibility

GPUBench ensures result credibility through triple cross-validation:

  1. vLLM official bench serve: Under the same parameters, GPUBench values must be consistent with the official tool
  2. Server /metrics endpoint: Validate internal histogram data
  3. Self-statistical calculation: Window-based throughput calculation, using numpy.percentile to compute quantiles (with minimum sample size protection to avoid fake P99) If the three are inconsistent, it indicates a problem (either the tool or the system under test).
5

Section 05

Knee Point Detection: Finding the Performance Critical Point

Knee Point Definition

The critical point where the performance curve shifts from linear throughput growth and stable latency to sharp latency increase and flat or declining throughput.

Detection Method

GPUBench scans different request rates, concurrency levels, input lengths, and output lengths to plot a complete performance curve and locate the knee point.

Importance

  • Before the knee point: Healthy resource utilization, good user experience
  • After the knee point: Queue buildup, latency spike, deteriorated user experience Helps operation and maintenance personnel determine the safe operation boundary of the service.
6

Section 06

Engineering Details: Statistical Integrity and Reproducibility

Statistical Integrity

  • Window-based throughput calculation (not simple average of request rates)
  • TPOT calculation formula: (E2E - TTFT) / (output_tokens -1)
  • Quantile calculation uses numpy.percentile with minimum sample size protection
  • Failed requests are tracked separately and not mixed into latency statistics

Reproducibility

  • Provides Dockerfile and docker-compose configurations
  • Environment variable template (.env.example)
  • Detailed configuration file directory (configs/)
  • Jupyter notebooks for result analysis
7

Section 07

Application Scenarios: Practical Value of GPUBench

GPUBench is suitable for the following scenarios:

  1. Model selection comparison: Compare inference performance of different models on the same hardware
  2. Hardware selection evaluation: Test the acceleration effect of new GPUs on specific models
  3. Service capacity planning: Determine the maximum concurrency under a given latency SLO
  4. Configuration tuning: Validate the impact of vLLM scheduling strategies, KV Cache management, and other parameters
  5. Regression testing: Monitor performance degradation in CI/CD pipelines
8

Section 08

Conclusion: Evolution of LLM Inference Testing from 'Usable' to 'Trustworthy'

GPUBench represents the evolution of LLM inference performance testing from 'usable' to 'trustworthy'. It is not just a benchmark script but a complete measurement methodology:

  • Correct coordination omission handling ensures real latency data
  • Triple cross-validation ensures credible results
  • Knee point analysis provides an intuitive basis for capacity planning
  • GPU telemetry correlation helps locate performance bottlenecks For teams deploying or optimizing LLM inference services, GPUBench provides a more reliable decision basis than simple QPS/TPS tests, which is a prerequisite for correct architectural decisions under complex AI infrastructure.