Zing Forum

Reading

LLM Inference Batching Benchmark: Quantifying Performance Gains of Continuous Batching from First Principles

A reproducible LLM inference batching benchmark project that quantifies the impact of batching strategies on latency, throughput, GPU memory, and KV cache by comparing Hugging Face static batching with a custom continuous batching scheduler.

LLM推理批处理基准测试连续批处理vLLM性能优化TTFT吞吐量KV缓存GPU内存
Published 2026-06-04 03:42Recent activity 2026-06-04 03:50Estimated read 7 min
LLM Inference Batching Benchmark: Quantifying Performance Gains of Continuous Batching from First Principles
1

Section 01

[Introduction] Core Overview of LLM Inference Batching Benchmark

This project is a reproducible LLM inference batching benchmark aimed at quantifying the performance gains of continuous batching over static batching from first principles. It corely compares Hugging Face static batching with a custom continuous batching scheduler, analyzing the impact of batching strategies on latency, throughput, GPU memory, and KV cache.

Project Author/Maintainer: prasannakotyal Source Platform: GitHub Original Title: llm-inference-benchmarking Original Link: https://github.com/prasannakotyal/llm-inference-benchmarking Update Time: 2026-06-03

2

Section 02

Research Background and Problem Definition

In LLM inference services, batching is a core technology to improve throughput and resource utilization. Traditional static batching requires all requests to have the same sequence length, while continuous batching allows dynamic addition of new requests, enhancing GPU utilization.

However, batching strategies involve trade-offs: larger batches improve throughput but may increase TTFT (Time To First Token); continuous batching is flexible but incurs scheduling overhead when request lengths vary significantly. This project aims to answer: How do different batching strategies perform on real hardware? What is the performance gain of continuous batching over static batching? Are the gains applicable to all scenarios?

3

Section 03

Testing Methodology and Hardware Environment

Dual Backend Comparison

Two PyTorch paths are tested:

  • hf-static: Hugging Face static batching (same input length)
  • continuous: Custom KV cache scheduler (dynamic request addition, no vLLM/SGLang custom CUDA kernels, high universality)

Synthetic Prompt Design

Synthetic token IDs are used instead of natural language to ensure precise length control, reproducibility, and tokenizer independence.

Test Parameter Matrix

Covers models (Qwen2.5-0.5B/1.5B), prompt lengths (64/256/512), batch sizes (1/4/8), concurrent requests (4/8/16), and generation targets (alternating 16/32 tokens).

Hardware Configuration

RunPod platform: 2x NVIDIA RTX PRO4000 Blackwell (24467 MiB per card), Driver 580.159.04, CUDA13.0, PyTorch2.12.0+cu130, Transformers5.10.1, FP16 precision.

4

Section 04

Key Findings and Data Analysis

Finding 1: Decisive Impact of Batching on Throughput

When batch size increases from 1 to 8, Qwen2.5-0.5B throughput rises from ~50 to 280+ tokens/sec (5x+ gain), and Qwen2.5-1.5B from ~42 to 240+ tokens/sec.

Finding 2: Relationship Between TTFT and Queue Depth

TTFT increases significantly when concurrent requests exceed batch capacity: For example, with 512 prompt length, batch size 8, and 16 concurrent requests, Qwen0.5B has an average TTFT of ~368ms, and Qwen1.5B ~480ms.

Finding3: Linear Growth of KV Cache Memory

Qwen0.5B: 64 tokens →1.11MB/request,512 tokens→6.36MB/request; Qwen1.5B:64 tokens→2.60MB/request,512 tokens→14.85MB/request.

Finding4: Scenario Dependence of Continuous Batching

  • Length-aligned scenarios: Continuous batching performs equivalently or slightly better (e.g., Qwen1.5B with 64 prompt length, batch size8, concurrent requests8:248 vs static 241 tokens/sec)
  • Length-heterogeneous scenarios: Throughput drops significantly (e.g.,512 prompt length, batch size8, concurrent requests16:145 vs static275 tokens/sec)
5

Section 05

Engineering Practice Value and Production Insights

Engineering Value

  • Reproducibility: uv-managed dependencies (pyproject.toml+uv.lock) ensure consistent results
  • Automation scripts: run_smoke.sh (smoke test), run_runpod_suite.sh (full test), run_runpod_qwen_1_5b_suite.sh (large model test)
  • Visualization: Throughput comparison, TTFT distribution, ITL heatmap, KV growth curve, peak memory chart

Production Insights

  1. Prioritize batching to improve throughput
  2. Continuous batching requires request length alignment or intelligent grouping
  3. Monitor queue depth and configure backpressure mechanisms to avoid excessive TTFT
  4. Capacity planning: For example, Qwen1.5B with512 prompt length needs14.85MB KV cache per request; a 24GB GPU with batch size8 has ~120MB KV cache—total memory usage must be calculated.
6

Section 06

Technical Limitations and Future Directions

Limitations

  • The custom continuous batching scheduler is a pure Python implementation with overhead during grouping
  • No integration of optimized CUDA kernels from vLLM/SGLang

Future Directions

  • Integrate more efficient scheduling implementations
  • Test 7B/13B-level large models
  • Explore dynamic batch size adjustment strategies
  • Add multi-GPU parallel inference tests