Reading

From HTTP Services to Token Services: A Practical Guide to LLM Inference Performance Diagnosis

LLM推理vLLM性能优化KubernetesGPUTTFTTPOTKV缓存大语言模型推理延迟

Published 2026-06-15 07:45Recent activity 2026-06-15 07:51Estimated read 10 min

From HTTP Services to Token Services: A Practical Guide to LLM Inference Performance Diagnosis

Section 01

Introduction: Paradigm Shift and Core Methods for LLM Inference Performance Diagnosis

This article deeply analyzes performance diagnosis methods for LLM inference services on Kubernetes platforms, using vLLM experimental data to reveal the relationships between key metrics such as TTFT, TPOT, prefill, and decoding, helping platform engineers understand the multi-dimensional nature of inference latency. LLM inference service performance tuning is completely different from conventional web services; traditional monitoring methods fail to capture their state characteristics, so analysis must be conducted from three resource dimensions (compute, memory bandwidth, memory capacity) and three latency signals (TTFT, TPOT, queue wait time).

Section 02

Background: Why Traditional Monitoring Fails for LLM Inference Services

Traditional HTTP service monitoring focuses on request success rate, response time, CPU, and memory usage, but LLM inference services have unique state characteristics involving three independent resource dimensions and three independent latency signals.

Three Resource Dimensions

Compute Resources (Prefill Phase): Process input prompts, compute attention matrices, strongly correlated with input length.
Memory Bandwidth (Decoding Phase): Read KV cache when generating each output token, limited by VRAM bandwidth.
Memory Capacity (KV Cache): VRAM space for storing attention key-value pairs, easily exhausted by long sequences or high concurrency.

Three Latency Signals

TTFT (Time To First Token): Time from request sending to receiving the first output token
TPOT (Time Per Output Token): Time taken to generate each subsequent token
Queue Wait Time: Time requests spend waiting in the server queue

These three signals change independently; the same symptom may have different root causes.

Section 03

Experimental Environment and Configuration

The experiment was conducted based on the following configuration:

Inference Framework: vLLM 0.6.6.post1
Model: Qwen2.5-7B-Instruct
GPU: NVIDIA A40 (48GB VRAM)
Deployment Mode: Single node
Monitoring Method: Collect Prometheus-format metrics from vLLM's /metrics endpoint

vLLM was chosen because it is a popular open-source inference engine, and Qwen2.5-7B-Instruct is suitable for resource-constrained scenarios.

Section 04

Experiment 1: Impact of Context Length on TTFT

Experiment Design

Fix the number of output tokens at 64, gradually increase input prompt length, and observe metric changes:

Input Tokens	TTFT(ms)	Prefill(ms)	Decode(ms)	TPOT(ms)
121	37.3	36.5	1831.1	29.07
511	73.6	73.1	1818.1	28.86
2041	261.8	260.9	1824.4	28.96
8191	1736.0	1734.0	1934.3	30.70

Key Findings

TTFT is almost equal to prefill time (difference <1ms in single-concurrency scenarios).
Prefill time grows superlinearly (4x input growth leads to 6.6x prefill growth), reflecting the O(n²) complexity of the attention mechanism.
TPOT remains relatively stable (29ms → 30.7ms); the decoding phase is minimally affected by input length.
Input length mainly drives prefill and TTFT, with little impact on the decoding phase.

Section 05

Experiment 2: Core Insights on Concurrency and Resource Saturation

Core Insights

Short Prompt Scenario: System concurrency is limited by GPU compute capacity; increasing concurrency leads to longer prefill time and worse TTFT.
Long Prompt Scenario: The limiting factor shifts to VRAM capacity; each request occupies a large amount of KV cache, significantly reducing concurrency.

Diagnostic Implications

When users report slow inference, distinguish between:

Compute Bottleneck: Use faster GPUs, model quantization, or distributed tensor parallelism.
VRAM Capacity Bottleneck: Reduce concurrency, enable KV cache compression, or use larger VRAM.
VRAM Bandwidth Bottleneck: Use higher-bandwidth VRAM or optimize attention implementation.

Section 06

Detailed Explanation of Key Concepts: Prefill, Decoding, and KV Cache

Prefill

Process input prompts, compute attention representations for all input tokens in parallel, and determine TTFT (the time users wait for the first token).

Decoding

Generate output tokens one by one, relying on all previous tokens; sequential execution cannot be parallelized, and determines TPOT (content generation fluency).

KV Cache

Stores key-value vectors of tokens to avoid repeated computation during decoding, but grows linearly with sequence length and concurrency, easily becoming a VRAM bottleneck.

Trade-off Between TTFT and TPOT

Optimizing one may worsen the other: for example, increasing batch size improves throughput (better TPOT) but increases queue latency (worse TTFT).

Section 07

Production Environment Application Recommendations: Monitoring, Capacity Planning, and Future Directions

Monitoring Strategy

End-to-End Latency: Track P50/P95/P99 percentiles of TTFT and TPOT.
Queue Depth: Monitor the number of waiting requests.
VRAM Usage: Track KV cache occupancy and peak values.
Token Throughput: Track the number of tokens generated per second.

Capacity Planning

Consider three dimensions: compute capacity (GPU power + model size), VRAM capacity (model parameters + KV cache), and bandwidth capacity (VRAM bandwidth + attention mode).

Future Directions

Visualize multi-dimensional metrics with Prometheus/Grafana dashboards.
Implement inference-aware routing using Gateway API.
Explore monitoring and tuning strategies for distributed inference.

Section 08

Conclusion: Paradigm Shift and Core Insights for LLM Inference Services

LLM inference services mark a paradigm shift from traditional HTTP services to "Token Services", requiring rethinking of performance monitoring, capacity planning, and fault diagnosis methods.

Core Insights:

Symptom ≠ Cause: The same "slow" issue may stem from three different bottlenecks: compute, VRAM capacity, or bandwidth.
Multi-Dimensional Monitoring Is Necessary: A single metric cannot capture the complex behavior of inference services.
Input Features Determine Bottlenecks: Short and long prompts are limited by different resources, requiring different optimization strategies.

For Kubernetes platform engineers, understanding these differences is the foundation of providing reliable LLM services, and deep performance analysis will become an essential operation and maintenance skill.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23