Zing 论坛

正文

连续批处理LLM推理的排队论性能建模:理论与实践结合的系统研究

本文介绍 EE384S-Project,一个结合 SimPy 仿真器、解析模型和真实 vLLM 测量实验的综合性研究项目,深入分析连续批处理 LLM 推理中的 TTFT、吞吐量和阻塞行为。

LLM推理连续批处理排队论性能建模vLLMTTFT优化系统研究
发布时间 2026/06/16 08:40最近活动 2026/06/16 08:52预计阅读 7 分钟
连续批处理LLM推理的排队论性能建模:理论与实践结合的系统研究
1

章节 01

Continuous Batching LLM Inference Performance Modeling: A System Study Combining Theory & Practice

This post introduces the EE384S-Project by Jav331 (source: GitHub, link: https://github.com/Jav331/EE384S-Project, updated 2026-06-16). It's a comprehensive study combining queueing theory, SimPy simulation, analytical models, and real vLLM hardware measurements to analyze key performance metrics of continuous batching in LLM inference—including TTFT (Time to First Token), throughput (goodput), and blocking behavior.

The project bridges theoretical modeling with practical system behavior, offering insights for researchers and LLM inference deployers.

2

章节 02

Research Background & Problem Motivation

LLM inference performance optimization is a core challenge in AI infrastructure. Unlike training, inference faces dynamic request patterns, varying input/output lengths, and limited GPU memory.

Continuous batching boosts GPU utilization by dynamically combining requests at the iteration level but introduces resource competition: KV-cache capacity limits, batch size tradeoffs, and arrival rate fluctuations—all affecting end-to-end latency and throughput.

Traditional models struggle to capture these dynamics, so queueing theory is used as a rigorous framework to analyze continuous batching behavior.

3

章节 03

Trinity Research Methodology

The project uses three integrated approaches:

  1. SimPy Simulator: Fine-grained discrete event simulation modeling request arrival, KV-cache allocation, batch scheduling, and blocking/preemption—providing a controlled environment for validating analytical models.
  2. Analytical Models: Multi-level models (closed-form expressions, Markov chains, hybrid models using measured service rates) to characterize TTFT, goodput, and blocking probability.
  3. Real vLLM Measurements: Using Modal cloud platform and vLLM on A10G GPU with Qwen2.5-1.5B-Instruct for empirical validation—forming a simulation-theory-measurement closed loop.
4

章节 04

Core Research Questions & Key Metrics

Core question: How do arrival rate, batch width, request length, and KV-cache capacity jointly impact system performance?

Key metrics defined:

  1. TTFT: Time from request submission to first token output (critical for user experience, focusing on p95/p99 tail latencies).
  2. Goodput: Rate of successfully processed requests (excludes blocked/failed ones, reflecting effective service capacity).
  3. Blocking Probability: Probability of request rejection due to KV-cache shortage or full batch queue.
  4. Preemption Behavior: Frequency and impact of long requests releasing resources for shorter ones.
5

章节 05

Key Experimental Findings

  1. Simulation vs Analytical Models: 48 configuration comparisons show goodput predictions are most accurate (average relative error:0.177), while p95/p99 TTFT predictions are challenging (average ~1.8), indicating tail latency modeling remains an open problem.

  2. vLLM Hardware Measurements: On A10G GPU with Qwen2.5-1.5B-Instruct:

    • Max observed goodput:6.74 req/s
    • Worst p99 TTFT:0.185 seconds
    • Average TTFT:<0.061 seconds
    • Average TPOT (per output token time):8.3-10.2 ms

Notably, no blocking or preemption was observed—suggesting experimental loads didn't reach system bottlenecks, pointing to future higher-pressure tests.

6

章节 06

Technical Insights & Practical Implications

  1. Tail Latency Complexity: Increasing KV-cache budget reduces blocking but may increase tail latency (non-monotonic tradeoff), so resource allocation needs careful balancing.
  2. Simulation vs Real System Gap: No blocking/preemption in vLLM tests vs simulation—possible reasons: experimental load not hitting thresholds, or vLLM implementation differences from simplified simulation models.
  3. Measurement Infrastructure Value: The project's reusable pipeline (trace preprocessing to result aggregation) provides a foundation for systematic performance studies.
7

章节 07

Limitations & Future Directions

Limitations: vLLM experiments didn't trigger KV-blocking or preemption (need higher-pressure tests to simulate real-world bottlenecks).

Future directions:

  • Validate larger models (7B,70B) and multi-GPU parallel scenarios.
  • Test more complex request length distributions.
  • Improve tail latency prediction accuracy by aligning analytical models with real data.