Zing Forum

Reading

LLM Inference Framework Performance Showdown: In-depth Evaluation of vLLM, SGLang, and Ollama on Ampere and Hopper Architectures

Cross-generational hardware testing based on NVIDIA A10G and H100 GPUs, comparing and analyzing the throughput, latency, and concurrent scalability of three mainstream LLM inference frameworks. SGLang achieves a 3.4x performance advantage over vLLM on H100, while Ollama faces architectural bottlenecks in high-concurrency scenarios.

LLM推理vLLMSGLangOllamaGPU基准测试AmpereHopperH100A10G大模型部署
Published 2026-04-20 12:12Recent activity 2026-04-20 12:19Estimated read 9 min
LLM Inference Framework Performance Showdown: In-depth Evaluation of vLLM, SGLang, and Ollama on Ampere and Hopper Architectures
1

Section 01

Introduction: Core Conclusions of In-depth Performance Evaluation of Three LLM Inference Frameworks Across Generations of GPUs

This article conducts a systematic performance evaluation of three mainstream LLM inference frameworks—vLLM, SGLang, and Ollama—on two generations of NVIDIA GPUs: Ampere (A10G) and Hopper (H100). Key findings include: SGLang achieves a 3.4x throughput advantage over vLLM on H100 with significantly lower per-request latency; Ollama has architectural bottlenecks in high-concurrency scenarios; SGLang can more fully utilize the capabilities of next-generation GPU hardware. This article will analyze from dimensions such as background, testing methodology, core results, and selection recommendations to provide quantitative basis for framework selection.

2

Section 02

Background: Core Dilemmas in LLM Inference Framework Selection and Significance of This Evaluation

With the popularization of production deployment of large language models, performance differences between inference frameworks directly affect service costs and user experience. Current mainstream frameworks include vLLM (optimized with PagedAttention), SGLang (runtime-optimized), and Ollama (local deployment-oriented). However, developers face the problem that the real performance under different hardware generations and concurrent loads is unclear. Existing tests mostly focus on a single platform or framework, lacking systematic comparisons across GPU architectures and frameworks. This evaluation is based on two generations of GPUs (A10G and H100) and uses a unified methodology to provide a quantifiable decision-making basis for framework selection.

3

Section 03

Testing Methodology and Experimental Design: Rigorous Cross-generational GPU Comparison Scheme

This test was led by Shivansh Singh from Northeastern University and follows the MLPerf Inference specification. Core test parameters: Model is Llama3.1 8B Instruct (AWQ-INT4 quantized), dataset is real ShareGPT conversations, concurrency levels are 1/8/32/64/128, 300 requests per level (excluding 10 warm-up requests), maximum output of 128 tokens, evaluation metrics: TTFT/TPOT/ITL/end-to-end latency (P50/P95/P99). Hardware configuration comparison:

Hardware A10G H100 SXM
Architecture Ampere (sm_86) Hopper (sm_90)
VRAM 24 GB GDDR6X 80 GB HBM3
Memory Bandwidth 600 GB/s 3,350 GB/s
FlashAttention v2 v3

The model and software environment are consistent across both platforms, only the hardware differs.

4

Section 04

Key Findings: SGLang's Overwhelming Advantage in Throughput and Latency

Test results show that SGLang significantly outperforms vLLM on both GPU platforms, with the advantage amplifying as hardware upgrades:

Throughput Comparison

GPU Platform vLLM SGLang SGLang Advantage
A10G 739 tok/s 1,151 tok/s 1.6x
H100 1,814 tok/s 6,242 tok/s 3.4x

From A10G to H100, SGLang's performance increases by 5.4x, while vLLM only increases by 2.5x, indicating that it can better utilize H100's HBM3 bandwidth and FlashAttention-3 optimizations.

Per-request Latency

On H100, the per-request latency of SGLang is only 450ms, while vLLM reaches 4359ms (nearly a 10x gap); SGLang also maintains sub-second response on A10G, which is crucial for latency-sensitive applications such as chatbots.

5

Section 05

Ollama's Architectural Bottleneck: Performance Collapse in High-concurrency Scenarios

Ollama shows obvious architectural limitations in high-concurrency scenarios: success rate drops sharply when concurrent users exceed 8, with only a 0.7% success rate at 128 concurrency. The root cause is that the underlying llama.cpp engine uses a fixed-slot parallel architecture without dynamic batching mechanism; when concurrency exceeds the preset slots, requests are rejected or timed out. Application scenario recommendations: Personal local development, low-concurrency edge deployment, latency-insensitive background tasks; for high-concurrency production environments, vLLM or SGLang are recommended.

6

Section 06

Cross-generational GPU Scalability Analysis: SGLang's Efficient Utilization of Next-generation Hardware

SGLang achieves a 5.4x performance improvement on H100 (vLLM only 2.5x), which comes from: 1. Memory bandwidth utilization: H100's bandwidth is 5.6x that of A10G, and SGLang's access pattern is more compatible; 2. Computational scheduling: Hopper Tensor Core improvements align with SGLang's operator fusion; 3. Automatic kernel optimization: Both GPUs are automatically converted to awq_marlin kernels without manual tuning. ROI implications: Upgrading vLLM to H100 gives a 2.5x improvement; migrating to SGLang + upgrading to H100 gives a comprehensive gain of 8.4x (3.4x ×2.5x), so the combination of framework migration and hardware upgrade is more cost-effective.

7

Section 07

Engineering Practice Recommendations: Framework Selection Guide for Different Scenarios

Based on the evaluation results, recommendations for different scenarios:

  • High-throughput services (API/batch inference/multi-tenant): Recommend SGLang (dynamic batching, KV Cache management, runtime optimization);
  • Latency-sensitive applications (chatbots/real-time assistants): Recommend SGLang (sub-second response);
  • Rapid prototyping (personal/local testing/low-concurrency demos): Ollama is optional (ease of use), but avoid production deployment;
  • Legacy system migration: vLLM is still stable and reliable with a mature ecosystem; if migration costs cannot be afforded, continue using it.
8

Section 08

Limitations and Future Directions: Boundaries of This Evaluation and Expansion Plans

Limitations of this test: 1. Each configuration was run only once, no confidence intervals; 2. GPU clocks were not locked, possibly 5-15% fluctuation; 3. Closed-loop load generation (semaphore control), not open-loop Poisson arrival; 4. Only Llama3.1 8B was tested, other models may perform differently. Future expansion directions: Larger parameter models (70B/400B), multi-GPU tensor parallelism, long-context (32K+) inference, comparison of different quantization schemes (FP8/GPTQ).