Zing Forum

Reading

From HTTP Services to Token Services: A Practical Guide to LLM Inference Performance Diagnosis

This article deeply analyzes performance diagnosis methods for LLM inference services on Kubernetes platforms, using vLLM experimental data to reveal the relationships between key metrics such as TTFT, TPOT, prefill, and decoding, helping platform engineers understand the multi-dimensional nature of inference latency.

LLM推理vLLM性能优化KubernetesGPUTTFTTPOTKV缓存大语言模型推理延迟
Published 2026-06-15 07:45Recent activity 2026-06-15 07:51Estimated read 10 min
From HTTP Services to Token Services: A Practical Guide to LLM Inference Performance Diagnosis
1

Section 01

Introduction: Paradigm Shift and Core Methods for LLM Inference Performance Diagnosis

This article deeply analyzes performance diagnosis methods for LLM inference services on Kubernetes platforms, using vLLM experimental data to reveal the relationships between key metrics such as TTFT, TPOT, prefill, and decoding, helping platform engineers understand the multi-dimensional nature of inference latency. LLM inference service performance tuning is completely different from conventional web services; traditional monitoring methods fail to capture their state characteristics, so analysis must be conducted from three resource dimensions (compute, memory bandwidth, memory capacity) and three latency signals (TTFT, TPOT, queue wait time).

2

Section 02

Background: Why Traditional Monitoring Fails for LLM Inference Services

Traditional HTTP service monitoring focuses on request success rate, response time, CPU, and memory usage, but LLM inference services have unique state characteristics involving three independent resource dimensions and three independent latency signals.

Three Resource Dimensions

  1. Compute Resources (Prefill Phase): Process input prompts, compute attention matrices, strongly correlated with input length.
  2. Memory Bandwidth (Decoding Phase): Read KV cache when generating each output token, limited by VRAM bandwidth.
  3. Memory Capacity (KV Cache): VRAM space for storing attention key-value pairs, easily exhausted by long sequences or high concurrency.

Three Latency Signals

  • TTFT (Time To First Token): Time from request sending to receiving the first output token
  • TPOT (Time Per Output Token): Time taken to generate each subsequent token
  • Queue Wait Time: Time requests spend waiting in the server queue

These three signals change independently; the same symptom may have different root causes.

3

Section 03

Experimental Environment and Configuration

The experiment was conducted based on the following configuration:

  • Inference Framework: vLLM 0.6.6.post1
  • Model: Qwen2.5-7B-Instruct
  • GPU: NVIDIA A40 (48GB VRAM)
  • Deployment Mode: Single node
  • Monitoring Method: Collect Prometheus-format metrics from vLLM's /metrics endpoint

vLLM was chosen because it is a popular open-source inference engine, and Qwen2.5-7B-Instruct is suitable for resource-constrained scenarios.

4

Section 04

Experiment 1: Impact of Context Length on TTFT

Experiment Design

Fix the number of output tokens at 64, gradually increase input prompt length, and observe metric changes:

Input Tokens TTFT(ms) Prefill(ms) Decode(ms) TPOT(ms)
121 37.3 36.5 1831.1 29.07
511 73.6 73.1 1818.1 28.86
2041 261.8 260.9 1824.4 28.96
8191 1736.0 1734.0 1934.3 30.70

Key Findings

  • TTFT is almost equal to prefill time (difference <1ms in single-concurrency scenarios).
  • Prefill time grows superlinearly (4x input growth leads to 6.6x prefill growth), reflecting the O(n²) complexity of the attention mechanism.
  • TPOT remains relatively stable (29ms → 30.7ms); the decoding phase is minimally affected by input length.
  • Input length mainly drives prefill and TTFT, with little impact on the decoding phase.
5

Section 05

Experiment 2: Core Insights on Concurrency and Resource Saturation

Core Insights

  • Short Prompt Scenario: System concurrency is limited by GPU compute capacity; increasing concurrency leads to longer prefill time and worse TTFT.
  • Long Prompt Scenario: The limiting factor shifts to VRAM capacity; each request occupies a large amount of KV cache, significantly reducing concurrency.

Diagnostic Implications

When users report slow inference, distinguish between:

  1. Compute Bottleneck: Use faster GPUs, model quantization, or distributed tensor parallelism.
  2. VRAM Capacity Bottleneck: Reduce concurrency, enable KV cache compression, or use larger VRAM.
  3. VRAM Bandwidth Bottleneck: Use higher-bandwidth VRAM or optimize attention implementation.
6

Section 06

Detailed Explanation of Key Concepts: Prefill, Decoding, and KV Cache

Prefill

Process input prompts, compute attention representations for all input tokens in parallel, and determine TTFT (the time users wait for the first token).

Decoding

Generate output tokens one by one, relying on all previous tokens; sequential execution cannot be parallelized, and determines TPOT (content generation fluency).

KV Cache

Stores key-value vectors of tokens to avoid repeated computation during decoding, but grows linearly with sequence length and concurrency, easily becoming a VRAM bottleneck.

Trade-off Between TTFT and TPOT

Optimizing one may worsen the other: for example, increasing batch size improves throughput (better TPOT) but increases queue latency (worse TTFT).

7

Section 07

Production Environment Application Recommendations: Monitoring, Capacity Planning, and Future Directions

Monitoring Strategy

  1. End-to-End Latency: Track P50/P95/P99 percentiles of TTFT and TPOT.
  2. Queue Depth: Monitor the number of waiting requests.
  3. VRAM Usage: Track KV cache occupancy and peak values.
  4. Token Throughput: Track the number of tokens generated per second.

Capacity Planning

Consider three dimensions: compute capacity (GPU power + model size), VRAM capacity (model parameters + KV cache), and bandwidth capacity (VRAM bandwidth + attention mode).

Future Directions

  1. Visualize multi-dimensional metrics with Prometheus/Grafana dashboards.
  2. Implement inference-aware routing using Gateway API.
  3. Explore monitoring and tuning strategies for distributed inference.
8

Section 08

Conclusion: Paradigm Shift and Core Insights for LLM Inference Services

LLM inference services mark a paradigm shift from traditional HTTP services to "Token Services", requiring rethinking of performance monitoring, capacity planning, and fault diagnosis methods.

Core Insights:

  1. Symptom ≠ Cause: The same "slow" issue may stem from three different bottlenecks: compute, VRAM capacity, or bandwidth.
  2. Multi-Dimensional Monitoring Is Necessary: A single metric cannot capture the complex behavior of inference services.
  3. Input Features Determine Bottlenecks: Short and long prompts are limited by different resources, requiring different optimization strategies.

For Kubernetes platform engineers, understanding these differences is the foundation of providing reliable LLM services, and deep performance analysis will become an essential operation and maintenance skill.