# From HTTP Services to Token Services: A Practical Guide to LLM Inference Performance Diagnosis

> This article deeply analyzes performance diagnosis methods for LLM inference services on Kubernetes platforms, using vLLM experimental data to reveal the relationships between key metrics such as TTFT, TPOT, prefill, and decoding, helping platform engineers understand the multi-dimensional nature of inference latency.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-14T23:45:59.000Z
- 最近活动: 2026-06-14T23:51:48.409Z
- 热度: 163.9
- 关键词: LLM推理, vLLM, 性能优化, Kubernetes, GPU, TTFT, TPOT, KV缓存, 大语言模型, 推理延迟
- 页面链接: https://www.zingnex.cn/en/forum/thread/httptoken-llm
- Canonical: https://www.zingnex.cn/forum/thread/httptoken-llm
- Markdown 来源: floors_fallback

---

## Introduction: Paradigm Shift and Core Methods for LLM Inference Performance Diagnosis

This article deeply analyzes performance diagnosis methods for LLM inference services on Kubernetes platforms, using vLLM experimental data to reveal the relationships between key metrics such as TTFT, TPOT, prefill, and decoding, helping platform engineers understand the multi-dimensional nature of inference latency. LLM inference service performance tuning is completely different from conventional web services; traditional monitoring methods fail to capture their state characteristics, so analysis must be conducted from three resource dimensions (compute, memory bandwidth, memory capacity) and three latency signals (TTFT, TPOT, queue wait time).

## Background: Why Traditional Monitoring Fails for LLM Inference Services

Traditional HTTP service monitoring focuses on request success rate, response time, CPU, and memory usage, but LLM inference services have unique state characteristics involving three independent resource dimensions and three independent latency signals.

### Three Resource Dimensions
1. **Compute Resources (Prefill Phase)**: Process input prompts, compute attention matrices, strongly correlated with input length.
2. **Memory Bandwidth (Decoding Phase)**: Read KV cache when generating each output token, limited by VRAM bandwidth.
3. **Memory Capacity (KV Cache)**: VRAM space for storing attention key-value pairs, easily exhausted by long sequences or high concurrency.

### Three Latency Signals
- **TTFT (Time To First Token)**: Time from request sending to receiving the first output token
- **TPOT (Time Per Output Token)**: Time taken to generate each subsequent token
- **Queue Wait Time**: Time requests spend waiting in the server queue

These three signals change independently; the same symptom may have different root causes.

## Experimental Environment and Configuration

The experiment was conducted based on the following configuration:
- **Inference Framework**: vLLM 0.6.6.post1
- **Model**: Qwen2.5-7B-Instruct
- **GPU**: NVIDIA A40 (48GB VRAM)
- **Deployment Mode**: Single node
- **Monitoring Method**: Collect Prometheus-format metrics from vLLM's /metrics endpoint

vLLM was chosen because it is a popular open-source inference engine, and Qwen2.5-7B-Instruct is suitable for resource-constrained scenarios.

## Experiment 1: Impact of Context Length on TTFT

### Experiment Design
Fix the number of output tokens at 64, gradually increase input prompt length, and observe metric changes:

| Input Tokens | TTFT(ms) | Prefill(ms) | Decode(ms) | TPOT(ms) |
|--------------|----------|-------------|------------|----------|
| 121          | 37.3     | 36.5        | 1831.1     | 29.07    |
| 511          | 73.6     | 73.1        | 1818.1     | 28.86    |
| 2041         | 261.8    | 260.9       | 1824.4     | 28.96    |
| 8191         | 1736.0   | 1734.0      | 1934.3     | 30.70    |

### Key Findings
- TTFT is almost equal to prefill time (difference <1ms in single-concurrency scenarios).
- Prefill time grows superlinearly (4x input growth leads to 6.6x prefill growth), reflecting the O(n²) complexity of the attention mechanism.
- TPOT remains relatively stable (29ms → 30.7ms); the decoding phase is minimally affected by input length.
- Input length mainly drives prefill and TTFT, with little impact on the decoding phase.

## Experiment 2: Core Insights on Concurrency and Resource Saturation

### Core Insights
- **Short Prompt Scenario**: System concurrency is limited by GPU compute capacity; increasing concurrency leads to longer prefill time and worse TTFT.
- **Long Prompt Scenario**: The limiting factor shifts to VRAM capacity; each request occupies a large amount of KV cache, significantly reducing concurrency.

### Diagnostic Implications
When users report slow inference, distinguish between:
1. Compute Bottleneck: Use faster GPUs, model quantization, or distributed tensor parallelism.
2. VRAM Capacity Bottleneck: Reduce concurrency, enable KV cache compression, or use larger VRAM.
3. VRAM Bandwidth Bottleneck: Use higher-bandwidth VRAM or optimize attention implementation.

## Detailed Explanation of Key Concepts: Prefill, Decoding, and KV Cache

### Prefill
Process input prompts, compute attention representations for all input tokens in parallel, and determine TTFT (the time users wait for the first token).

### Decoding
Generate output tokens one by one, relying on all previous tokens; sequential execution cannot be parallelized, and determines TPOT (content generation fluency).

### KV Cache
Stores key-value vectors of tokens to avoid repeated computation during decoding, but grows linearly with sequence length and concurrency, easily becoming a VRAM bottleneck.

### Trade-off Between TTFT and TPOT
Optimizing one may worsen the other: for example, increasing batch size improves throughput (better TPOT) but increases queue latency (worse TTFT).

## Production Environment Application Recommendations: Monitoring, Capacity Planning, and Future Directions

### Monitoring Strategy
1. **End-to-End Latency**: Track P50/P95/P99 percentiles of TTFT and TPOT.
2. **Queue Depth**: Monitor the number of waiting requests.
3. **VRAM Usage**: Track KV cache occupancy and peak values.
4. **Token Throughput**: Track the number of tokens generated per second.

### Capacity Planning
Consider three dimensions: compute capacity (GPU power + model size), VRAM capacity (model parameters + KV cache), and bandwidth capacity (VRAM bandwidth + attention mode).

### Future Directions
1. Visualize multi-dimensional metrics with Prometheus/Grafana dashboards.
2. Implement inference-aware routing using Gateway API.
3. Explore monitoring and tuning strategies for distributed inference.

## Conclusion: Paradigm Shift and Core Insights for LLM Inference Services

LLM inference services mark a paradigm shift from traditional HTTP services to "Token Services", requiring rethinking of performance monitoring, capacity planning, and fault diagnosis methods.

Core Insights:
1. **Symptom ≠ Cause**: The same "slow" issue may stem from three different bottlenecks: compute, VRAM capacity, or bandwidth.
2. **Multi-Dimensional Monitoring Is Necessary**: A single metric cannot capture the complex behavior of inference services.
3. **Input Features Determine Bottlenecks**: Short and long prompts are limited by different resources, requiring different optimization strategies.

For Kubernetes platform engineers, understanding these differences is the foundation of providing reliable LLM services, and deep performance analysis will become an essential operation and maintenance skill.
