Zing Forum

Reading

NVIDIA LLM Inference Benchmark: A Comprehensive Comparative Study from Single Requests to Production-Level Workloads

A systematic LLM inference engine benchmark framework comparing the differences in latency, throughput, and system behavior between Hugging Face Transformers, vLLM, and TensorRT-LLM, covering multi-stage experiments from RTX 3090 to A100.

LLM推理基准测试vLLMTensorRT-LLMGPU优化A100RTX3090吞吐量测试
Published 2026-04-29 12:41Recent activity 2026-04-29 12:56Estimated read 8 min
NVIDIA LLM Inference Benchmark: A Comprehensive Comparative Study from Single Requests to Production-Level Workloads
1

Section 01

[Main Floor/Introduction] Core Overview of NVIDIA LLM Inference Benchmark

This study uses a systematic benchmark framework to compare the differences in latency, throughput, and system behavior among three mainstream LLM inference engines: Hugging Face Transformers, vLLM, and TensorRT-LLM. The experiments cover hardware configurations from consumer-grade RTX 3090 to data center-grade A100, divided into five progressive stages (local prototype → configuration-driven → dual-engine comparison → three-engine comprehensive comparison → production-level workload testing), aiming to provide developers and architects with scientific references for technical selection.

2

Section 02

Project Background and Research Motivation

As LLMs move from research to production deployment, inference efficiency has become a cost-critical factor. However, developers often struggle to decide among numerous engine options (e.g., HF Transformers, vLLM, TensorRT-LLM). The nvidia-llm-inference-bench project was born to comprehensively evaluate engine differences from local to production-level workloads through five-stage experiments, covering multiple hardware configurations and providing empirical references for deployments of different scales.

3

Section 03

Five-Stage Experimental Design Methodology

The experiment adopts a phased iterative methodology:

  1. Local Baseline Establishment: Use distilgpt2 to verify process correctness (prompt management, latency/throughput calculation, etc.);
  2. Configuration-Driven Framework: Refactor into YAML configuration-driven to support reproducible results and aggregated summaries;
  3. Dual-Engine Comparison: Compare HF Transformers and vLLM on RTX3090, finding vLLM has lower latency and higher throughput;
  4. Three-Engine Comprehensive Comparison: Add TensorRT-LLM to evaluate the performance of the three engines under different output lengths;
  5. Production-Level Workload Testing: Simulate QPS traffic to test engine performance under high concurrency (e.g., vLLM saturation point, TRT-LLM's advantage in high load).
4

Section 04

Key Performance Evidence and Findings

Single-Request Performance (Phase4)

  • Throughput: TensorRT-LLM (50.7 tok/s for default output) > vLLM (50.3 tok/s) > HF Transformers (~42-43 tok/s);
  • Latency: TensorRT-LLM (1.26s for default output) is slightly better than vLLM (1.27s), while HF is significantly higher (~1.50s).

Production Workload Performance (Phase5)

  • RTX3090: vLLM scales linearly below 30 QPS; beyond that, latency increases sharply. TensorRT-LLM reduces latency by 25-30% and increases throughput by 30-35% under high QPS;
  • A100: With the advantage of continuous batching, vLLM's maximum sustainable throughput (49 QPS) far exceeds Triton+TensorRT-LLM (36 QPS);
  • Triton+TRT-LLM: Suitable for multi-model production pipelines, but scheduling overhead becomes a bottleneck in single-model high-concurrency scenarios.
5

Section 05

Technical Contributions and Methodology Highlights

  1. Strict Variable Control: Same model (Qwen2.5-7B-Instruct), hardware, aligned tokenizer, fixed output length;
  2. Progressive Complexity: From local prototype to production-level A100 testing, each stage has clear objectives;
  3. Rich Visualization: Generate latency comparison charts, throughput curves, QPS scaling trends, etc.;
  4. Reproducible Process: All configurations and scripts are included in version control, with README documentation to support reproduction.
6

Section 06

Core Conclusions and Selection Recommendations

Core Conclusions

  • Single-request scenario: TensorRT-LLM pursues extreme performance; vLLM has close performance and an active ecosystem; HF is suitable for rapid prototyping;
  • Production workload: For RTX3090 high QPS, choose TensorRT-LLM; for A100 high concurrency, choose vLLM; for multi-model, choose Triton+TRT-LLM.

Selection Decision Framework

Scenario Recommended Engine Reason
Rapid prototyping/research HF Transformers Easy to use with no additional dependencies
High-concurrency single-model service vLLM Optimized for continuous batching, active community
Pursuit of extreme performance TensorRT-LLM Kernel fusion, highest GPU utilization
Multi-model production pipeline Triton+TensorRT-LLM Mature model management and service orchestration
Edge/resource-constrained deployment vLLM Flexible memory management and quantization support
7

Section 07

Limitations and Future Work

Current Limitations

  • Limited workload diversity (no evaluation of 256-512 token long generation);
  • Triton dynamic batching not fully optimized;
  • Single workload mode (steady-state QPS, no burst traffic);
  • Focus on single GPU, no exploration of multi-GPU distributed inference.

Future Plans

  • Long output benchmark testing;
  • Triton dynamic batching parameter tuning;
  • Burst traffic simulation;
  • GPU utilization correlation analysis;
  • Multi-GPU scaling evaluation.