Reading

vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation

Introducing a reproducible benchmark tool for vLLM that comprehensively evaluates VLM and LLM models across six dimensions: accuracy, latency, throughput, concurrency, stability, and token budget.

基准测试vLLMLLM评估VLM性能测试模型选型QwenGPU优化生产部署吞吐量测试

Published 2026-04-25 11:14Recent activity 2026-04-25 11:23Estimated read 7 min

Section 01

[Introduction] vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation

Introducing vLM-LLM-Benchmark, a reproducible benchmark tool for vLLM. It comprehensively evaluates LLM and VLM models across six dimensions—accuracy, latency, throughput, concurrency, stability, and token budget—addressing the complex trade-offs in model replacement decisions in production environments.

Section 02

Real-World Dilemmas in Model Evaluation

Traditional model evaluation is limited to single accuracy metrics or theoretical performance, failing to meet the complex needs of production environments: numerical offset errors (e.g., recognizing "120 yuan" as "1200 yuan"), first-token latency exceeding 2 seconds impairing user experience, single-user tests failing to reflect concurrency stability, performance degradation due to memory leaks, silent truncation leading to information loss, etc. The six-dimensional system of vLM-LLM-Benchmark provides reliable support for production decisions.

Section 03

Detailed Explanation of the Six-Dimensional Evaluation System

Accuracy: Based on golden standard datasets, test classification precision, entity recall, fact recall, and forbidden output detection; 2. First Token Latency (TTFT): Record P50/P95 percentiles—latency exceeding 2 seconds is considered to impair user experience; 3. Throughput: Token processing rate per second under sustained load, which determines the number of users per node, capacity planning, and cost-effectiveness; 4. Concurrency Capability: Test success rate and latency distribution at concurrency levels of 1/5/10/30/50 to reveal stability and bottlenecks under pressure;5. Stability: Run continuously for 30 minutes, compare latency drift between the first and last 5 minutes to detect issues like memory leaks;6. Token Budget: Analyze input/output token distribution and truncation rate for cost monitoring, silent truncation detection, and configuration optimization.

Section 04

Reference Model Matrix

The framework provides default configurations for 4 reference models:

Role	Model	Quantization	Port	VRAM	Minimum Hardware
VLM Main Selection	Qwen3-VL-8B-Instruct	BF16	8001	20GB	A100-40G
VLM Baseline	Qwen2.5-VL-7B-Instruct	BF16	8002	18GB	A100-40G
LLM Main Selection	Qwen3-30B-A3B-Instruct-2507-FP8	FP8	9001	35GB	H100-80G
LLM Flagship	Qwen3-235B-A22B-Instruct-2507-FP8	FP8	9002	240GB	8×H100-80G
Although the flagship MoE model has a large number of parameters, only about 22B parameters are activated per forward pass, so its latency can compete with small dense models while delivering higher quality.

Section 05

Usage and Result Interpretation

Offline Deployment: 1. Download resources on a networked machine: git clone https://github.com/qiurui144/vlm-llm-benchmark.git → MODEL_SET=standard bash scripts/prepare_offline.sh; 2. Package and transfer to an offline GPU host;3. Deploy: bash scripts/bootstrap.sh → bash run.sh. Test Execution: Compare models (run_benchmark.py --model ...), test only LLM concurrency, smoke test for flagship models, etc. Result Interpretation: Generate Markdown reports; pass/warning/failure are based on thresholds in golden/expectations.json; return codes 0/1/2 can be integrated into CI/CD.

Section 06

Technical Implementation Highlights

Provider-agnostic design: Interact via HTTP with OpenAI-compatible endpoints, supporting vLLM, SGLang, LMDeploy, etc.;2. Customizable model configuration: Edit models.yaml to add models (name, HuggingFace repository, port, role, etc.);3. Golden standard dataset: Encourage users to build based on business scenarios; the default 9 cases are for demonstration purposes.

Section 07

Production Decision Support and Applicable Scenarios

Core Value: Answer "Can model X replace model Y?" to help teams quantify upgrade risks, optimize resource allocation, ensure user experience, and control costs. Applicable Scenarios: Model selection decisions, version upgrade verification, hardware planning, continuous integration.

Section 08

Conclusion

Against the backdrop of rapid AI model iteration, vLM-LLM-Benchmark provides technical teams with a scientific, comprehensive, and reproducible decision-making tool through its six-dimensional evaluation system. It bridges theoretical performance and actual production needs, making it an indispensable support tool for model upgrades or selection.

vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation

[Introduction] vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation

Real-World Dilemmas in Model Evaluation

Detailed Explanation of the Six-Dimensional Evaluation System

Reference Model Matrix

Usage and Result Interpretation

Technical Implementation Highlights

Production Decision Support and Applicable Scenarios

Conclusion

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model