Zing Forum

Reading

vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation

Introducing a reproducible benchmark tool for vLLM that comprehensively evaluates VLM and LLM models across six dimensions: accuracy, latency, throughput, concurrency, stability, and token budget.

基准测试vLLMLLM评估VLM性能测试模型选型QwenGPU优化生产部署吞吐量测试
Published 2026-04-25 11:14Recent activity 2026-04-25 11:23Estimated read 7 min
vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation
1

Section 01

[Introduction] vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation

Introducing vLM-LLM-Benchmark, a reproducible benchmark tool for vLLM. It comprehensively evaluates LLM and VLM models across six dimensions—accuracy, latency, throughput, concurrency, stability, and token budget—addressing the complex trade-offs in model replacement decisions in production environments.

2

Section 02

Real-World Dilemmas in Model Evaluation

Traditional model evaluation is limited to single accuracy metrics or theoretical performance, failing to meet the complex needs of production environments: numerical offset errors (e.g., recognizing "120 yuan" as "1200 yuan"), first-token latency exceeding 2 seconds impairing user experience, single-user tests failing to reflect concurrency stability, performance degradation due to memory leaks, silent truncation leading to information loss, etc. The six-dimensional system of vLM-LLM-Benchmark provides reliable support for production decisions.

3

Section 03

Detailed Explanation of the Six-Dimensional Evaluation System

  1. Accuracy: Based on golden standard datasets, test classification precision, entity recall, fact recall, and forbidden output detection; 2. First Token Latency (TTFT): Record P50/P95 percentiles—latency exceeding 2 seconds is considered to impair user experience; 3. Throughput: Token processing rate per second under sustained load, which determines the number of users per node, capacity planning, and cost-effectiveness; 4. Concurrency Capability: Test success rate and latency distribution at concurrency levels of 1/5/10/30/50 to reveal stability and bottlenecks under pressure;5. Stability: Run continuously for 30 minutes, compare latency drift between the first and last 5 minutes to detect issues like memory leaks;6. Token Budget: Analyze input/output token distribution and truncation rate for cost monitoring, silent truncation detection, and configuration optimization.
4

Section 04

Reference Model Matrix

The framework provides default configurations for 4 reference models:

Role Model Quantization Port VRAM Minimum Hardware
VLM Main Selection Qwen3-VL-8B-Instruct BF16 8001 20GB A100-40G
VLM Baseline Qwen2.5-VL-7B-Instruct BF16 8002 18GB A100-40G
LLM Main Selection Qwen3-30B-A3B-Instruct-2507-FP8 FP8 9001 35GB H100-80G
LLM Flagship Qwen3-235B-A22B-Instruct-2507-FP8 FP8 9002 240GB 8×H100-80G
Although the flagship MoE model has a large number of parameters, only about 22B parameters are activated per forward pass, so its latency can compete with small dense models while delivering higher quality.
5

Section 05

Usage and Result Interpretation

Offline Deployment: 1. Download resources on a networked machine: git clone https://github.com/qiurui144/vlm-llm-benchmark.gitMODEL_SET=standard bash scripts/prepare_offline.sh; 2. Package and transfer to an offline GPU host;3. Deploy: bash scripts/bootstrap.shbash run.sh. Test Execution: Compare models (run_benchmark.py --model ...), test only LLM concurrency, smoke test for flagship models, etc. Result Interpretation: Generate Markdown reports; pass/warning/failure are based on thresholds in golden/expectations.json; return codes 0/1/2 can be integrated into CI/CD.

6

Section 06

Technical Implementation Highlights

  1. Provider-agnostic design: Interact via HTTP with OpenAI-compatible endpoints, supporting vLLM, SGLang, LMDeploy, etc.;2. Customizable model configuration: Edit models.yaml to add models (name, HuggingFace repository, port, role, etc.);3. Golden standard dataset: Encourage users to build based on business scenarios; the default 9 cases are for demonstration purposes.
7

Section 07

Production Decision Support and Applicable Scenarios

Core Value: Answer "Can model X replace model Y?" to help teams quantify upgrade risks, optimize resource allocation, ensure user experience, and control costs. Applicable Scenarios: Model selection decisions, version upgrade verification, hardware planning, continuous integration.

8

Section 08

Conclusion

Against the backdrop of rapid AI model iteration, vLM-LLM-Benchmark provides technical teams with a scientific, comprehensive, and reproducible decision-making tool through its six-dimensional evaluation system. It bridges theoretical performance and actual production needs, making it an indispensable support tool for model upgrades or selection.