# LLM Inference Batching Benchmark: Quantifying Performance Gains of Continuous Batching from First Principles

> A reproducible LLM inference batching benchmark project that quantifies the impact of batching strategies on latency, throughput, GPU memory, and KV cache by comparing Hugging Face static batching with a custom continuous batching scheduler.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-03T19:42:58.000Z
- 最近活动: 2026-06-03T19:50:45.840Z
- 热度: 145.9
- 关键词: LLM推理, 批处理, 基准测试, 连续批处理, vLLM, 性能优化, TTFT, 吞吐量, KV缓存, GPU内存
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-9f0ca3ee
- Canonical: https://www.zingnex.cn/forum/thread/llm-9f0ca3ee
- Markdown 来源: floors_fallback

---

## [Introduction] Core Overview of LLM Inference Batching Benchmark

This project is a reproducible LLM inference batching benchmark aimed at quantifying the performance gains of continuous batching over static batching from first principles. It corely compares Hugging Face static batching with a custom continuous batching scheduler, analyzing the impact of batching strategies on latency, throughput, GPU memory, and KV cache.

Project Author/Maintainer: prasannakotyal
Source Platform: GitHub
Original Title: llm-inference-benchmarking
Original Link: https://github.com/prasannakotyal/llm-inference-benchmarking
Update Time: 2026-06-03

## Research Background and Problem Definition

In LLM inference services, batching is a core technology to improve throughput and resource utilization. Traditional static batching requires all requests to have the same sequence length, while continuous batching allows dynamic addition of new requests, enhancing GPU utilization.

However, batching strategies involve trade-offs: larger batches improve throughput but may increase TTFT (Time To First Token); continuous batching is flexible but incurs scheduling overhead when request lengths vary significantly. This project aims to answer: How do different batching strategies perform on real hardware? What is the performance gain of continuous batching over static batching? Are the gains applicable to all scenarios?

## Testing Methodology and Hardware Environment

### Dual Backend Comparison
Two PyTorch paths are tested:
- hf-static: Hugging Face static batching (same input length)
- continuous: Custom KV cache scheduler (dynamic request addition, no vLLM/SGLang custom CUDA kernels, high universality)

### Synthetic Prompt Design
Synthetic token IDs are used instead of natural language to ensure precise length control, reproducibility, and tokenizer independence.

### Test Parameter Matrix
Covers models (Qwen2.5-0.5B/1.5B), prompt lengths (64/256/512), batch sizes (1/4/8), concurrent requests (4/8/16), and generation targets (alternating 16/32 tokens).

### Hardware Configuration
RunPod platform: 2x NVIDIA RTX PRO4000 Blackwell (24467 MiB per card), Driver 580.159.04, CUDA13.0, PyTorch2.12.0+cu130, Transformers5.10.1, FP16 precision.

## Key Findings and Data Analysis

### Finding 1: Decisive Impact of Batching on Throughput
When batch size increases from 1 to 8, Qwen2.5-0.5B throughput rises from ~50 to 280+ tokens/sec (5x+ gain), and Qwen2.5-1.5B from ~42 to 240+ tokens/sec.

### Finding 2: Relationship Between TTFT and Queue Depth
TTFT increases significantly when concurrent requests exceed batch capacity: For example, with 512 prompt length, batch size 8, and 16 concurrent requests, Qwen0.5B has an average TTFT of ~368ms, and Qwen1.5B ~480ms.

### Finding3: Linear Growth of KV Cache Memory
Qwen0.5B: 64 tokens →1.11MB/request,512 tokens→6.36MB/request; Qwen1.5B:64 tokens→2.60MB/request,512 tokens→14.85MB/request.

### Finding4: Scenario Dependence of Continuous Batching
- Length-aligned scenarios: Continuous batching performs equivalently or slightly better (e.g., Qwen1.5B with 64 prompt length, batch size8, concurrent requests8:248 vs static 241 tokens/sec)
- Length-heterogeneous scenarios: Throughput drops significantly (e.g.,512 prompt length, batch size8, concurrent requests16:145 vs static275 tokens/sec)

## Engineering Practice Value and Production Insights

### Engineering Value
- Reproducibility: uv-managed dependencies (pyproject.toml+uv.lock) ensure consistent results
- Automation scripts: run_smoke.sh (smoke test), run_runpod_suite.sh (full test), run_runpod_qwen_1_5b_suite.sh (large model test)
- Visualization: Throughput comparison, TTFT distribution, ITL heatmap, KV growth curve, peak memory chart

### Production Insights
1. Prioritize batching to improve throughput
2. Continuous batching requires request length alignment or intelligent grouping
3. Monitor queue depth and configure backpressure mechanisms to avoid excessive TTFT
4. Capacity planning: For example, Qwen1.5B with512 prompt length needs14.85MB KV cache per request; a 24GB GPU with batch size8 has ~120MB KV cache—total memory usage must be calculated.

## Technical Limitations and Future Directions

### Limitations
- The custom continuous batching scheduler is a pure Python implementation with overhead during grouping
- No integration of optimized CUDA kernels from vLLM/SGLang

### Future Directions
- Integrate more efficient scheduling implementations
- Test 7B/13B-level large models
- Explore dynamic batch size adjustment strategies
- Add multi-GPU parallel inference tests