# Service-Induced Congestion: The Hidden Performance Killer of Memory-Constrained LLM Inference

> The study reveals the phenomenon of "service-induced congestion" in LLM inference: continuous growth of KV cache leads to memory pressure, system request eviction causes up to 50% throughput loss, and a stability criterion for heterogeneous workloads is proposed.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-14T02:49:03.000Z
- 最近活动: 2026-06-16T01:53:49.703Z
- 热度: 103.9
- 关键词: LLM推理, KV缓存, 内存管理, 服务拥塞, 批处理优化, 吞吐量优化, 调度算法, 稳定性分析
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-9e794be1
- Canonical: https://www.zingnex.cn/forum/thread/llm-9e794be1
- Markdown 来源: floors_fallback

---

## [Main Floor/Introduction] Service-Induced Congestion: The Hidden Performance Killer of Memory-Constrained LLM Inference

# Service-Induced Congestion: The Hidden Performance Killer of Memory-Constrained LLM Inference (Introduction)
The study reveals the phenomenon of "service-induced congestion" in LLM inference: continuous growth of KV cache leads to memory pressure, and system request eviction causes up to 50% throughput loss. Through a discrete-time dynamic model, the problem is systematically revealed for the first time, and a stability criterion for heterogeneous workloads and scheduling design principles are proposed.

**Original Authors and Source**: 
- Author Team: Paper author team (arXiv:2606.15555v1)
- Source: arXiv
- Original Title: Service-Induced Congestion in Memory-Constrained LLM Serving
- Link: <http://arxiv.org/abs/2606.15555v1>
- Publication Time: June 14, 2026

## [Problem Background] Endogenous Growth of KV Cache and Memory Pressure

# Problem Background: Endogenous Growth of KV Cache and Memory Pressure
Modern LLMs use autoregressive generation; each token generation requires accessing previous KV cache, which grows continuously during the generation process. Multiple requests in a batch share GPU memory, and the aggregate memory usage grows endogenously over time (even if input length is fixed). When memory capacity is insufficient, the system is forced to evict active requests, discard the computed KV cache, and restart, leading to computational waste and a sudden drop in throughput.

## [Key Findings] Structural Instability of Homogeneous Workloads and Worst-Case Limit Cycles

# Key Findings: Structural Instability of Homogeneous Workloads and Worst-Case Limit Cycles
The study establishes a discrete-time dynamic model covering request admission, memory growth, and eviction mechanisms. Under saturated input:
1. **No-eviction fixed point is unstable**: The no-eviction equilibrium point for homogeneous workloads (same input/output length) exists theoretically but is unstable;
2. **Worst-case limit cycle**: The system almost certainly converges to a unique worst-case limit cycle, with throughput loss up to 50%. This indicates that service-induced congestion is a structurally unstable mechanism in memory-constrained LLM serving.

## [Key Breakthrough] Stability Criterion for Heterogeneous Workloads

# Key Breakthrough: Stability Criterion for Heterogeneous Workloads
For heterogeneous workloads (different input/output lengths), the study achieves breakthrough findings:
- **Two-category scenario**: It is proven that a stability criterion exists, with the key being the "survival polynomial mechanism"—differences in completion times of requests with different lengths break synchronization;
- **Coprime decoding lengths**: Under input-dominated scaling conditions, coprime decoding lengths can stabilize the no-eviction equilibrium, while non-coprime lengths tend to cause synchronization instability. This provides guidance for scheduling design: use workload heterogeneity to suppress congestion.

## [Practical Recommendations] Design Principles for LLM Inference Scheduling

# Practical Recommendations: Design Principles for LLM Inference Scheduling
Based on theoretical analysis, scheduling principles to maintain high throughput are derived:
1. **Avoid homogeneous batches**: Try not to put requests with exactly the same input/output length into the same batch;
2. **Leverage length diversity**: Introduce output length diversity during scheduling—even if inputs are the same, this can improve stability;
3. **Beware of synchronization patterns**: Monitor periodic throughput fluctuations and adjust batch composition in a timely manner;
4. **Dynamic memory budget**: Reserve a safety margin, do not pursue 100% memory utilization to reduce eviction costs.

## [Correlation Analysis] Relationship with Existing LLM Inference Optimization Directions

# Correlation Analysis: Relationship with Existing LLM Inference Optimization Directions
- **vLLM's PagedAttention**: Reduces memory fragmentation but cannot solve the capacity pressure from endogenous growth;
- **Speculative Decoding**: Accelerates generation but increases the growth rate of KV cache;
- **Continuous Batching**: Dynamically adding requests may introduce new synchronization patterns, requiring careful design;
- **KV Cache Compression/Quantization**: Reduces memory usage per request and delays capacity pressure, but does not change the endogenous growth dynamics.

## [Industry Insights] Operational Insights for LLM Service Providers

# Industry Insights: Operational Insights for LLM Service Providers
- **Performance degradation cause**: Throughput drop during peak hours may stem from service-induced congestion rather than the model itself;
- **Capacity planning**: The simple calculation of "memory / memory per request = concurrency" is insufficient—time dynamics of KV cache growth must be considered;
- **Scheduling priority**: Scheduling should balance the impact of length diversity on stability, not just FCFS (First-Come-First-Served) or shortest job first;
- **Monitoring expansion**: Need to monitor dynamic indicators such as eviction frequency and KV cache growth rate, complementing average latency and throughput.