Reading

Queueing Theory Performance Modeling for Continuous Batching LLM Inference: A Systematic Study Combining Theory and Practice

This article introduces the EE384S-Project, a comprehensive research project that combines SimPy simulator, analytical models, and real vLLM measurement experiments to deeply analyze TTFT, throughput, and blocking behavior in continuous batching LLM inference.

LLM推理连续批处理排队论性能建模vLLMTTFT优化系统研究

Published 2026-06-16 08:40Recent activity 2026-06-16 08:52Estimated read 7 min

Queueing Theory Performance Modeling for Continuous Batching LLM Inference: A Systematic Study Combining Theory and Practice

Section 01

Continuous Batching LLM Inference Performance Modeling: A System Study Combining Theory & Practice

This post introduces the EE384S-Project by Jav331 (source: GitHub, link: https://github.com/Jav331/EE384S-Project, updated 2026-06-16). It's a comprehensive study combining queueing theory, SimPy simulation, analytical models, and real vLLM hardware measurements to analyze key performance metrics of continuous batching in LLM inference—including TTFT (Time to First Token), throughput (goodput), and blocking behavior.

The project bridges theoretical modeling with practical system behavior, offering insights for researchers and LLM inference deployers.

Section 02

Research Background & Problem Motivation

Optimizing LLM inference performance is a core challenge in AI infrastructure. Unlike training, inference faces dynamic request patterns, varying input/output lengths, and limited GPU memory. Continuous batching improves GPU utilization by dynamically combining requests at the iteration level, but introduces resource competition: KV-cache capacity limits, batch size trade-offs, and arrival rate fluctuations—all of which affect end-to-end latency and throughput. Traditional models struggle to capture these dynamics, so queueing theory is used as a rigorous framework to analyze continuous batching behavior.

Section 03

Trinity Research Methodology

The project uses three integrated approaches:

SimPy Simulator: Fine-grained discrete event simulation that models request arrival, KV-cache allocation, batch scheduling, and blocking/preemption—providing a controlled environment for validating analytical models.
Analytical Models: Multi-level models (closed-form expressions, Markov chains, hybrid models using measured service rates) to characterize TTFT, goodput, and blocking probability.
Real vLLM Measurements: Empirical validation using the Modal cloud platform and vLLM on A10G GPU with Qwen2.5-1.5B-Instruct—forming a closed loop of simulation-theory-measurement.

Section 04

Core Research Questions & Key Metrics

Core question: How do arrival rate, batch width, request length, and KV-cache capacity jointly impact system performance?

Key metrics defined:

TTFT: Time from request submission to first token output (critical for user experience, focusing on p95/p99 tail latencies).
Goodput: Rate of successfully processed requests (excludes blocked/failed ones, reflecting effective service capacity).
Blocking Probability: Probability of request rejection due to KV-cache shortage or full batch queue.
Preemption Behavior: Frequency and impact of long requests releasing resources for shorter ones.

Section 05

Key Experimental Findings

Key Experimental Findings:

Simulation vs Analytical Models: Comparisons across 48 configurations show that goodput predictions are the most accurate (average relative error: 0.177), while p95/p99 TTFT predictions are challenging (average ~1.8), indicating that tail latency modeling remains an open problem.
vLLM Hardware Measurements: On A10G GPU with Qwen2.5-1.5B-Instruct:
- Max observed goodput: 6.74 req/s
- Worst p99 TTFT: 0.185 seconds
- Average TTFT: <0.061 seconds
- Average TPOT (per output token time): 8.3-10.2 ms

Notably, no blocking or preemption was observed—suggesting that experimental loads did not reach system bottlenecks, pointing to future higher-pressure tests.

Section 06

Technical Insights & Practical Implications

Technical Insights & Practical Implications:

Tail Latency Complexity: Increasing KV-cache budget reduces blocking but may increase tail latency (non-monotonic trade-off), so resource allocation requires careful balancing.
Gap Between Simulation and Real System: No blocking/preemption was observed in vLLM tests vs. simulation—possible reasons: experimental load did not hit thresholds, or vLLM implementation differs from simplified simulation models.
Value of Measurement Infrastructure: The project's reusable pipeline (from trace preprocessing to result aggregation) provides a foundation for systematic performance studies.

Section 07

Limitations & Future Directions

Limitations & Future Directions: Limitations: vLLM experiments did not trigger KV-blocking or preemption (higher-pressure tests are needed to simulate real-world bottlenecks).

Future directions:

Validate larger models (7B,70B) and multi-GPU parallel scenarios.
Test more complex request length distributions.
Improve tail latency prediction accuracy by aligning analytical models with real data.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23