# NVIDIA LLM Inference Benchmark: A Comprehensive Comparative Study from Single Requests to Production-Level Workloads

> A systematic LLM inference engine benchmark framework comparing the differences in latency, throughput, and system behavior between Hugging Face Transformers, vLLM, and TensorRT-LLM, covering multi-stage experiments from RTX 3090 to A100.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-29T04:41:42.000Z
- 最近活动: 2026-04-29T04:56:16.495Z
- 热度: 150.8
- 关键词: LLM推理, 基准测试, vLLM, TensorRT-LLM, GPU优化, A100, RTX3090, 吞吐量测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/nvidia-llm
- Canonical: https://www.zingnex.cn/forum/thread/nvidia-llm
- Markdown 来源: floors_fallback

---

## [Main Floor/Introduction] Core Overview of NVIDIA LLM Inference Benchmark

This study uses a systematic benchmark framework to compare the differences in latency, throughput, and system behavior among three mainstream LLM inference engines: Hugging Face Transformers, vLLM, and TensorRT-LLM. The experiments cover hardware configurations from consumer-grade RTX 3090 to data center-grade A100, divided into five progressive stages (local prototype → configuration-driven → dual-engine comparison → three-engine comprehensive comparison → production-level workload testing), aiming to provide developers and architects with scientific references for technical selection.

## Project Background and Research Motivation

As LLMs move from research to production deployment, inference efficiency has become a cost-critical factor. However, developers often struggle to decide among numerous engine options (e.g., HF Transformers, vLLM, TensorRT-LLM). The nvidia-llm-inference-bench project was born to comprehensively evaluate engine differences from local to production-level workloads through five-stage experiments, covering multiple hardware configurations and providing empirical references for deployments of different scales.

## Five-Stage Experimental Design Methodology

The experiment adopts a phased iterative methodology:
1. **Local Baseline Establishment**: Use distilgpt2 to verify process correctness (prompt management, latency/throughput calculation, etc.);
2. **Configuration-Driven Framework**: Refactor into YAML configuration-driven to support reproducible results and aggregated summaries;
3. **Dual-Engine Comparison**: Compare HF Transformers and vLLM on RTX3090, finding vLLM has lower latency and higher throughput;
4. **Three-Engine Comprehensive Comparison**: Add TensorRT-LLM to evaluate the performance of the three engines under different output lengths;
5. **Production-Level Workload Testing**: Simulate QPS traffic to test engine performance under high concurrency (e.g., vLLM saturation point, TRT-LLM's advantage in high load).

## Key Performance Evidence and Findings

### Single-Request Performance (Phase4)
- **Throughput**: TensorRT-LLM (~50.7 tok/s for default output) > vLLM (~50.3 tok/s) > HF Transformers (~42-43 tok/s);
- **Latency**: TensorRT-LLM (~1.26s for default output) is slightly better than vLLM (~1.27s), while HF is significantly higher (~1.50s).

### Production Workload Performance (Phase5)
- **RTX3090**: vLLM scales linearly below 30 QPS; beyond that, latency increases sharply. TensorRT-LLM reduces latency by 25-30% and increases throughput by 30-35% under high QPS;
- **A100**: With the advantage of continuous batching, vLLM's maximum sustainable throughput (~49 QPS) far exceeds Triton+TensorRT-LLM (~36 QPS);
- **Triton+TRT-LLM**: Suitable for multi-model production pipelines, but scheduling overhead becomes a bottleneck in single-model high-concurrency scenarios.

## Technical Contributions and Methodology Highlights

1. **Strict Variable Control**: Same model (Qwen2.5-7B-Instruct), hardware, aligned tokenizer, fixed output length;
2. **Progressive Complexity**: From local prototype to production-level A100 testing, each stage has clear objectives;
3. **Rich Visualization**: Generate latency comparison charts, throughput curves, QPS scaling trends, etc.;
4. **Reproducible Process**: All configurations and scripts are included in version control, with README documentation to support reproduction.

## Core Conclusions and Selection Recommendations

### Core Conclusions
- Single-request scenario: TensorRT-LLM pursues extreme performance; vLLM has close performance and an active ecosystem; HF is suitable for rapid prototyping;
- Production workload: For RTX3090 high QPS, choose TensorRT-LLM; for A100 high concurrency, choose vLLM; for multi-model, choose Triton+TRT-LLM.

### Selection Decision Framework
| Scenario | Recommended Engine | Reason |
|----------|--------------------|--------|
| Rapid prototyping/research | HF Transformers | Easy to use with no additional dependencies |
| High-concurrency single-model service | vLLM | Optimized for continuous batching, active community |
| Pursuit of extreme performance | TensorRT-LLM | Kernel fusion, highest GPU utilization |
| Multi-model production pipeline | Triton+TensorRT-LLM | Mature model management and service orchestration |
| Edge/resource-constrained deployment | vLLM | Flexible memory management and quantization support |

## Limitations and Future Work

### Current Limitations
- Limited workload diversity (no evaluation of 256-512 token long generation);
- Triton dynamic batching not fully optimized;
- Single workload mode (steady-state QPS, no burst traffic);
- Focus on single GPU, no exploration of multi-GPU distributed inference.

### Future Plans
- Long output benchmark testing;
- Triton dynamic batching parameter tuning;
- Burst traffic simulation;
- GPU utilization correlation analysis;
- Multi-GPU scaling evaluation.
