Reading

LLM Inference Batching Benchmark: Quantifying Performance Gains of Continuous Batching from First Principles

A reproducible LLM inference batching benchmark project that quantifies the impact of batching strategies on latency, throughput, GPU memory, and KV cache by comparing Hugging Face static batching with a custom continuous batching scheduler.

LLM推理批处理基准测试连续批处理vLLM性能优化TTFT吞吐量KV缓存GPU内存

Published 2026-06-04 03:42Recent activity 2026-06-04 03:50Estimated read 7 min

LLM Inference Batching Benchmark: Quantifying Performance Gains of Continuous Batching from First Principles

Section 01

[Introduction] Core Overview of LLM Inference Batching Benchmark

This project is a reproducible LLM inference batching benchmark aimed at quantifying the performance gains of continuous batching over static batching from first principles. It corely compares Hugging Face static batching with a custom continuous batching scheduler, analyzing the impact of batching strategies on latency, throughput, GPU memory, and KV cache.

Project Author/Maintainer: prasannakotyal Source Platform: GitHub Original Title: llm-inference-benchmarking Original Link: https://github.com/prasannakotyal/llm-inference-benchmarking Update Time: 2026-06-03

Section 02

Research Background and Problem Definition

In LLM inference services, batching is a core technology to improve throughput and resource utilization. Traditional static batching requires all requests to have the same sequence length, while continuous batching allows dynamic addition of new requests, enhancing GPU utilization.

However, batching strategies involve trade-offs: larger batches improve throughput but may increase TTFT (Time To First Token); continuous batching is flexible but incurs scheduling overhead when request lengths vary significantly. This project aims to answer: How do different batching strategies perform on real hardware? What is the performance gain of continuous batching over static batching? Are the gains applicable to all scenarios?

Section 03

Testing Methodology and Hardware Environment

Dual Backend Comparison

Two PyTorch paths are tested:

hf-static: Hugging Face static batching (same input length)
continuous: Custom KV cache scheduler (dynamic request addition, no vLLM/SGLang custom CUDA kernels, high universality)

Synthetic Prompt Design

Synthetic token IDs are used instead of natural language to ensure precise length control, reproducibility, and tokenizer independence.

Test Parameter Matrix

Covers models (Qwen2.5-0.5B/1.5B), prompt lengths (64/256/512), batch sizes (1/4/8), concurrent requests (4/8/16), and generation targets (alternating 16/32 tokens).

Hardware Configuration

RunPod platform: 2x NVIDIA RTX PRO4000 Blackwell (24467 MiB per card), Driver 580.159.04, CUDA13.0, PyTorch2.12.0+cu130, Transformers5.10.1, FP16 precision.

Section 04

Key Findings and Data Analysis

Finding 1: Decisive Impact of Batching on Throughput

When batch size increases from 1 to 8, Qwen2.5-0.5B throughput rises from ~50 to 280+ tokens/sec (5x+ gain), and Qwen2.5-1.5B from ~42 to 240+ tokens/sec.

Finding 2: Relationship Between TTFT and Queue Depth

TTFT increases significantly when concurrent requests exceed batch capacity: For example, with 512 prompt length, batch size 8, and 16 concurrent requests, Qwen0.5B has an average TTFT of ~368ms, and Qwen1.5B ~480ms.

Finding3: Linear Growth of KV Cache Memory

Qwen0.5B: 64 tokens →1.11MB/request,512 tokens→6.36MB/request; Qwen1.5B:64 tokens→2.60MB/request,512 tokens→14.85MB/request.

Finding4: Scenario Dependence of Continuous Batching

Length-aligned scenarios: Continuous batching performs equivalently or slightly better (e.g., Qwen1.5B with 64 prompt length, batch size8, concurrent requests8:248 vs static 241 tokens/sec)
Length-heterogeneous scenarios: Throughput drops significantly (e.g.,512 prompt length, batch size8, concurrent requests16:145 vs static275 tokens/sec)

Section 05

Engineering Practice Value and Production Insights

Engineering Value

Reproducibility: uv-managed dependencies (pyproject.toml+uv.lock) ensure consistent results
Automation scripts: run_smoke.sh (smoke test), run_runpod_suite.sh (full test), run_runpod_qwen_1_5b_suite.sh (large model test)
Visualization: Throughput comparison, TTFT distribution, ITL heatmap, KV growth curve, peak memory chart

Production Insights

Prioritize batching to improve throughput
Continuous batching requires request length alignment or intelligent grouping
Monitor queue depth and configure backpressure mechanisms to avoid excessive TTFT
Capacity planning: For example, Qwen1.5B with512 prompt length needs14.85MB KV cache per request; a 24GB GPU with batch size8 has ~120MB KV cache—total memory usage must be calculated.

Section 06

Technical Limitations and Future Directions

Limitations

The custom continuous batching scheduler is a pure Python implementation with overhead during grouping
No integration of optimized CUDA kernels from vLLM/SGLang

Future Directions

Integrate more efficient scheduling implementations
Test 7B/13B-level large models
Explore dynamic batch size adjustment strategies
Add multi-GPU parallel inference tests

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49