Reading

NVIDIA LLM Inference Benchmark: A Comprehensive Comparative Study from Single Requests to Production-Level Workloads

A systematic LLM inference engine benchmark framework comparing the differences in latency, throughput, and system behavior between Hugging Face Transformers, vLLM, and TensorRT-LLM, covering multi-stage experiments from RTX 3090 to A100.

LLM推理基准测试vLLMTensorRT-LLMGPU优化A100RTX3090吞吐量测试

Published 2026-04-29 12:41Recent activity 2026-04-29 12:56Estimated read 8 min

NVIDIA LLM Inference Benchmark: A Comprehensive Comparative Study from Single Requests to Production-Level Workloads

Section 01

[Main Floor/Introduction] Core Overview of NVIDIA LLM Inference Benchmark

This study uses a systematic benchmark framework to compare the differences in latency, throughput, and system behavior among three mainstream LLM inference engines: Hugging Face Transformers, vLLM, and TensorRT-LLM. The experiments cover hardware configurations from consumer-grade RTX 3090 to data center-grade A100, divided into five progressive stages (local prototype → configuration-driven → dual-engine comparison → three-engine comprehensive comparison → production-level workload testing), aiming to provide developers and architects with scientific references for technical selection.

Section 02

Project Background and Research Motivation

As LLMs move from research to production deployment, inference efficiency has become a cost-critical factor. However, developers often struggle to decide among numerous engine options (e.g., HF Transformers, vLLM, TensorRT-LLM). The nvidia-llm-inference-bench project was born to comprehensively evaluate engine differences from local to production-level workloads through five-stage experiments, covering multiple hardware configurations and providing empirical references for deployments of different scales.

Section 03

Five-Stage Experimental Design Methodology

The experiment adopts a phased iterative methodology:

Local Baseline Establishment: Use distilgpt2 to verify process correctness (prompt management, latency/throughput calculation, etc.);
Configuration-Driven Framework: Refactor into YAML configuration-driven to support reproducible results and aggregated summaries;
Dual-Engine Comparison: Compare HF Transformers and vLLM on RTX3090, finding vLLM has lower latency and higher throughput;
Three-Engine Comprehensive Comparison: Add TensorRT-LLM to evaluate the performance of the three engines under different output lengths;
Production-Level Workload Testing: Simulate QPS traffic to test engine performance under high concurrency (e.g., vLLM saturation point, TRT-LLM's advantage in high load).

Section 04

Key Performance Evidence and Findings

Single-Request Performance (Phase4)

Throughput: TensorRT-LLM (50.7 tok/s for default output) > vLLM (50.3 tok/s) > HF Transformers (~42-43 tok/s);
Latency: TensorRT-LLM (1.26s for default output) is slightly better than vLLM (1.27s), while HF is significantly higher (~1.50s).

Production Workload Performance (Phase5)

RTX3090: vLLM scales linearly below 30 QPS; beyond that, latency increases sharply. TensorRT-LLM reduces latency by 25-30% and increases throughput by 30-35% under high QPS;
A100: With the advantage of continuous batching, vLLM's maximum sustainable throughput (49 QPS) far exceeds Triton+TensorRT-LLM (36 QPS);
Triton+TRT-LLM: Suitable for multi-model production pipelines, but scheduling overhead becomes a bottleneck in single-model high-concurrency scenarios.

Section 05

Technical Contributions and Methodology Highlights

Strict Variable Control: Same model (Qwen2.5-7B-Instruct), hardware, aligned tokenizer, fixed output length;
Progressive Complexity: From local prototype to production-level A100 testing, each stage has clear objectives;
Rich Visualization: Generate latency comparison charts, throughput curves, QPS scaling trends, etc.;
Reproducible Process: All configurations and scripts are included in version control, with README documentation to support reproduction.

Section 06

Core Conclusions and Selection Recommendations

Core Conclusions

Single-request scenario: TensorRT-LLM pursues extreme performance; vLLM has close performance and an active ecosystem; HF is suitable for rapid prototyping;
Production workload: For RTX3090 high QPS, choose TensorRT-LLM; for A100 high concurrency, choose vLLM; for multi-model, choose Triton+TRT-LLM.

Selection Decision Framework

Scenario	Recommended Engine	Reason
Rapid prototyping/research	HF Transformers	Easy to use with no additional dependencies
High-concurrency single-model service	vLLM	Optimized for continuous batching, active community
Pursuit of extreme performance	TensorRT-LLM	Kernel fusion, highest GPU utilization
Multi-model production pipeline	Triton+TensorRT-LLM	Mature model management and service orchestration
Edge/resource-constrained deployment	vLLM	Flexible memory management and quantization support

Section 07

Limitations and Future Work

Current Limitations

Limited workload diversity (no evaluation of 256-512 token long generation);
Triton dynamic batching not fully optimized;
Single workload mode (steady-state QPS, no burst traffic);
Focus on single GPU, no exploration of multi-GPU distributed inference.

Future Plans

Long output benchmark testing;
Triton dynamic batching parameter tuning;
Burst traffic simulation;
GPU utilization correlation analysis;
Multi-GPU scaling evaluation.