Zing Forum

Reading

LLM Inference Service Benchmark: Performance Comparison Between vLLM and SGLang on Modal Cloud Platform

A systematic benchmark of two mainstream LLM inference frameworks, vLLM and SGLang, based on Modal GPU containers, covering Llama-3 8B and Mistral-7B models, evaluating key metrics such as throughput, latency, and cost per million tokens.

vLLMSGLangLLM推理基准测试ModalGPU推理吞吐量延迟优化PagedAttention结构化生成
Published 2026-06-13 22:46Recent activity 2026-06-13 22:59Estimated read 6 min
LLM Inference Service Benchmark: Performance Comparison Between vLLM and SGLang on Modal Cloud Platform
1

Section 01

LLM Inference Service Benchmark: Core Guide to Performance Comparison Between vLLM and SGLang on Modal Platform

This article conducts a systematic benchmark of two mainstream LLM inference frameworks, vLLM and SGLang, in the GPU container environment of the Modal cloud platform, covering Llama-3 8B and Mistral-7B models, evaluating key metrics such as throughput, latency (P50/P99), and cost per million tokens, providing empirical references for engineering teams in technical selection. The original project comes from GitHub user musel25's llm-serving-bench (published on 2026-06-13).

2

Section 02

Background: Challenges and Necessity of LLM Inference Framework Selection

LLM inference deployment is a core part of AI infrastructure, but framework selection lacks systematic data support. vLLM has risen rapidly with PagedAttention technology, and SGLang has gained attention for its structured generation and parallel decoding capabilities, but real performance comparisons are scattered across blogs and forums. In addition, performance is highly dependent on the deployment environment (local vs. cloud), and actual testing on the target platform is the only way for reliable selection.

3

Section 03

Testing Methods and Environment Configuration

The tests were conducted on the Modal cloud platform (serverless GPU, representing a typical scenario of cloud-native AI deployment), covering Meta Llama-3 8B and Mistral-7B models, which have similar parameter sizes but different architectures. Evaluation dimensions include:

  • Throughput: Number of tokens generated per unit time, reflecting processing capacity;
  • Latency distribution: P50 (median) and P99 (99th percentile) latency, measuring user experience;
  • Cost-effectiveness: Computing cost per million tokens, combining GPU instance running time and unit price.
4

Section 04

Technical Feature Comparison Between vLLM and SGLang

vLLM: Core innovation is PagedAttention, which analogizes KV cache to virtual memory paging, improving memory utilization, supporting more concurrency or longer contexts, providing strategies such as FCFS and priority scheduling, with a mature ecosystem. SGLang: Emphasizes structured generation (e.g., JSON Schema output), RadixAttention optimizes prefix cache reuse (suitable for RAG scenarios), parallel decoding + speculative execution reduces end-to-end latency (significant benefits for short sequences).

5

Section 05

Test Results and Key Insights

Throughput: Small differences under low concurrency; vLLM's PagedAttention advantage becomes apparent at high concurrency; SGLang performs prominently in prefix-sharing tasks. Latency: P50 latency is close; vLLM's P99 latency is more stable; SGLang's speculative decoding is effective for short sequences, with diminishing returns for long sequences. Cost: vLLM is slightly better overall (high memory efficiency); SGLang surpasses in specific prefix-sharing tasks; the difference is about 10-20%.

6

Section 06

Selection Recommendations and Conclusions

Selection Recommendations:

  • Choose vLLM: Long context (>4K), high concurrency multi-tenancy, high latency stability requirements, priority on ecosystem compatibility;
  • Choose SGLang: Structured generation needs, prefix-sharing batch tasks (e.g., RAG), short sequence latency-sensitive scenarios;
  • Hybrid strategy: Use vLLM for general queries, SGLang for structured tasks (need to balance operation and maintenance complexity). Conclusions: Both vLLM and SGLang are excellent frameworks; selection needs to combine business requirements, load characteristics, and team capabilities. The project has limitations (limited model/hardware coverage, synthetic load), and the test scope will be expanded in the future.