Zing Forum

Reading

Queueing Theory Performance Modeling for Continuous Batching LLM Inference: A Systematic Study Combining Theory and Practice

This article introduces the EE384S-Project, a comprehensive research project that combines SimPy simulator, analytical models, and real vLLM measurement experiments to deeply analyze TTFT, throughput, and blocking behavior in continuous batching LLM inference.

LLM推理连续批处理排队论性能建模vLLMTTFT优化系统研究
Published 2026-06-16 08:40Recent activity 2026-06-16 08:52Estimated read 7 min
Queueing Theory Performance Modeling for Continuous Batching LLM Inference: A Systematic Study Combining Theory and Practice
1

Section 01

Continuous Batching LLM Inference Performance Modeling: A System Study Combining Theory & Practice

This post introduces the EE384S-Project by Jav331 (source: GitHub, link: https://github.com/Jav331/EE384S-Project, updated 2026-06-16). It's a comprehensive study combining queueing theory, SimPy simulation, analytical models, and real vLLM hardware measurements to analyze key performance metrics of continuous batching in LLM inference—including TTFT (Time to First Token), throughput (goodput), and blocking behavior.

The project bridges theoretical modeling with practical system behavior, offering insights for researchers and LLM inference deployers.

2

Section 02

Research Background & Problem Motivation

Optimizing LLM inference performance is a core challenge in AI infrastructure. Unlike training, inference faces dynamic request patterns, varying input/output lengths, and limited GPU memory. Continuous batching improves GPU utilization by dynamically combining requests at the iteration level, but introduces resource competition: KV-cache capacity limits, batch size trade-offs, and arrival rate fluctuations—all of which affect end-to-end latency and throughput. Traditional models struggle to capture these dynamics, so queueing theory is used as a rigorous framework to analyze continuous batching behavior.

3

Section 03

Trinity Research Methodology

The project uses three integrated approaches:

  1. SimPy Simulator: Fine-grained discrete event simulation that models request arrival, KV-cache allocation, batch scheduling, and blocking/preemption—providing a controlled environment for validating analytical models.
  2. Analytical Models: Multi-level models (closed-form expressions, Markov chains, hybrid models using measured service rates) to characterize TTFT, goodput, and blocking probability.
  3. Real vLLM Measurements: Empirical validation using the Modal cloud platform and vLLM on A10G GPU with Qwen2.5-1.5B-Instruct—forming a closed loop of simulation-theory-measurement.
4

Section 04

Core Research Questions & Key Metrics

Core question: How do arrival rate, batch width, request length, and KV-cache capacity jointly impact system performance?

Key metrics defined:

  1. TTFT: Time from request submission to first token output (critical for user experience, focusing on p95/p99 tail latencies).
  2. Goodput: Rate of successfully processed requests (excludes blocked/failed ones, reflecting effective service capacity).
  3. Blocking Probability: Probability of request rejection due to KV-cache shortage or full batch queue.
  4. Preemption Behavior: Frequency and impact of long requests releasing resources for shorter ones.
5

Section 05

Key Experimental Findings

Key Experimental Findings:

  1. Simulation vs Analytical Models: Comparisons across 48 configurations show that goodput predictions are the most accurate (average relative error: 0.177), while p95/p99 TTFT predictions are challenging (average ~1.8), indicating that tail latency modeling remains an open problem.

  2. vLLM Hardware Measurements: On A10G GPU with Qwen2.5-1.5B-Instruct:

    • Max observed goodput: 6.74 req/s
    • Worst p99 TTFT: 0.185 seconds
    • Average TTFT: <0.061 seconds
    • Average TPOT (per output token time): 8.3-10.2 ms

Notably, no blocking or preemption was observed—suggesting that experimental loads did not reach system bottlenecks, pointing to future higher-pressure tests.

6

Section 06

Technical Insights & Practical Implications

Technical Insights & Practical Implications:

  1. Tail Latency Complexity: Increasing KV-cache budget reduces blocking but may increase tail latency (non-monotonic trade-off), so resource allocation requires careful balancing.
  2. Gap Between Simulation and Real System: No blocking/preemption was observed in vLLM tests vs. simulation—possible reasons: experimental load did not hit thresholds, or vLLM implementation differs from simplified simulation models.
  3. Value of Measurement Infrastructure: The project's reusable pipeline (from trace preprocessing to result aggregation) provides a foundation for systematic performance studies.
7

Section 07

Limitations & Future Directions

Limitations & Future Directions: Limitations: vLLM experiments did not trigger KV-blocking or preemption (higher-pressure tests are needed to simulate real-world bottlenecks).

Future directions:

  • Validate larger models (7B,70B) and multi-GPU parallel scenarios.
  • Test more complex request length distributions.
  • Improve tail latency prediction accuracy by aligning analytical models with real data.