# vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation

> Introducing a reproducible benchmark tool for vLLM that comprehensively evaluates VLM and LLM models across six dimensions: accuracy, latency, throughput, concurrency, stability, and token budget.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-25T03:14:32.000Z
- 最近活动: 2026-04-25T03:23:49.264Z
- 热度: 163.8
- 关键词: 基准测试, vLLM, LLM评估, VLM, 性能测试, 模型选型, Qwen, GPU优化, 生产部署, 吞吐量测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/vlm-llm-benchmark
- Canonical: https://www.zingnex.cn/forum/thread/vlm-llm-benchmark
- Markdown 来源: floors_fallback

---

## [Introduction] vLM-LLM-Benchmark: A Six-Dimensional Benchmark Framework for Production-Grade Model Evaluation

Introducing vLM-LLM-Benchmark, a reproducible benchmark tool for vLLM. It comprehensively evaluates LLM and VLM models across six dimensions—accuracy, latency, throughput, concurrency, stability, and token budget—addressing the complex trade-offs in model replacement decisions in production environments.

## Real-World Dilemmas in Model Evaluation

Traditional model evaluation is limited to single accuracy metrics or theoretical performance, failing to meet the complex needs of production environments: numerical offset errors (e.g., recognizing "120 yuan" as "1200 yuan"), first-token latency exceeding 2 seconds impairing user experience, single-user tests failing to reflect concurrency stability, performance degradation due to memory leaks, silent truncation leading to information loss, etc. The six-dimensional system of vLM-LLM-Benchmark provides reliable support for production decisions.

## Detailed Explanation of the Six-Dimensional Evaluation System

1. Accuracy: Based on golden standard datasets, test classification precision, entity recall, fact recall, and forbidden output detection; 2. First Token Latency (TTFT): Record P50/P95 percentiles—latency exceeding 2 seconds is considered to impair user experience; 3. Throughput: Token processing rate per second under sustained load, which determines the number of users per node, capacity planning, and cost-effectiveness; 4. Concurrency Capability: Test success rate and latency distribution at concurrency levels of 1/5/10/30/50 to reveal stability and bottlenecks under pressure;5. Stability: Run continuously for 30 minutes, compare latency drift between the first and last 5 minutes to detect issues like memory leaks;6. Token Budget: Analyze input/output token distribution and truncation rate for cost monitoring, silent truncation detection, and configuration optimization.

## Reference Model Matrix

The framework provides default configurations for 4 reference models:
| Role | Model | Quantization | Port | VRAM | Minimum Hardware |
|---|---|---|---|---|---|
| VLM Main Selection | Qwen3-VL-8B-Instruct | BF16 | 8001 | 20GB | A100-40G |
| VLM Baseline | Qwen2.5-VL-7B-Instruct | BF16 | 8002 |18GB |A100-40G |
| LLM Main Selection | Qwen3-30B-A3B-Instruct-2507-FP8 | FP8 |9001 |35GB |H100-80G |
| LLM Flagship | Qwen3-235B-A22B-Instruct-2507-FP8 | FP8 |9002 |240GB |8×H100-80G |
Although the flagship MoE model has a large number of parameters, only about 22B parameters are activated per forward pass, so its latency can compete with small dense models while delivering higher quality.

## Usage and Result Interpretation

**Offline Deployment**: 1. Download resources on a networked machine: `git clone https://github.com/qiurui144/vlm-llm-benchmark.git` → `MODEL_SET=standard bash scripts/prepare_offline.sh`; 2. Package and transfer to an offline GPU host;3. Deploy: `bash scripts/bootstrap.sh` → `bash run.sh`.
**Test Execution**: Compare models (`run_benchmark.py --model ...`), test only LLM concurrency, smoke test for flagship models, etc.
**Result Interpretation**: Generate Markdown reports; pass/warning/failure are based on thresholds in `golden/expectations.json`; return codes 0/1/2 can be integrated into CI/CD.

## Technical Implementation Highlights

1. Provider-agnostic design: Interact via HTTP with OpenAI-compatible endpoints, supporting vLLM, SGLang, LMDeploy, etc.;2. Customizable model configuration: Edit `models.yaml` to add models (name, HuggingFace repository, port, role, etc.);3. Golden standard dataset: Encourage users to build based on business scenarios; the default 9 cases are for demonstration purposes.

## Production Decision Support and Applicable Scenarios

**Core Value**: Answer "Can model X replace model Y?" to help teams quantify upgrade risks, optimize resource allocation, ensure user experience, and control costs.
**Applicable Scenarios**: Model selection decisions, version upgrade verification, hardware planning, continuous integration.

## Conclusion

Against the backdrop of rapid AI model iteration, vLM-LLM-Benchmark provides technical teams with a scientific, comprehensive, and reproducible decision-making tool through its six-dimensional evaluation system. It bridges theoretical performance and actual production needs, making it an indispensable support tool for model upgrades or selection.
