Performance Challenges of Large Model Inference
With the growth in parameter scale of Large Language Models (LLMs), performance optimization of inference services has become a core topic in AI infrastructure. Traditional monolithic inference approaches face two major bottlenecks:
- Low computational resource utilization: The Prefill (prompt processing) and Decode (token generation) stages have distinct computational characteristics, leading to resource mismatch when handled uniformly.
- Difficulty in balancing latency and throughput: Optimizing Time To First Token (TTFT) and overall Throughput often conflicts with each other.
The LLM-D (LLM Disaggregated Serving) architecture emerged to address these issues—by separating the Prefill and Decode stages and combining with intelligent scheduling strategies, it achieves more efficient resource utilization at the hardware level.
Project Overview
This project systematically tested and validated key features of LLM-D on the NVIDIA GH200 (Grace Hopper Superchip) platform, including:
Tested Technical Features
Aggregated Inference:
- Prefix-Cache Routing
- Queue-Depth Balancing
- HPA (Horizontal Pod Autoscaler) Auto-Scaling
P/D Disaggregated Inference (Prefill/Decode):
- NIXL-based KV Cache Transmission
- Time-Slice GPU Scheduling
Hardware Platform
NVIDIA GH200 is the core hardware for testing, with features including:
- Grace CPU + Hopper GPU Unified Architecture: High-bandwidth memory sharing, extremely low CPU-GPU communication latency.
- HBM3 High-Bandwidth Memory: Supports efficient inference of large models.
- Transformer Engine: Hardware-level acceleration to improve inference throughput.
- NVLink-C2C: Ultra-high bandwidth interconnection of 900GB/s between CPU and GPU.
Aggregated Inference Technology Details
Prefix-Cache Routing
Prefix cache is a key technology to improve efficiency in multi-turn dialogue and batch inference:
Working Principle:
- Store KV caches of processed prompts in a Trie structure.
- When a new request arrives, match the longest common prefix.
- Reuse the matched KV cache and only compute the new part.
Performance Benefits:
- Multi-turn dialogue scenarios: Subsequent round latency reduced by 50-80%.
- Batch similar requests: Shared prefixes are computed only once.
- Overall system throughput improvement: Reduces redundant computation and increases GPU utilization.
Implementation Challenges:
- Cache management strategy: Eviction algorithm when memory is limited.
- Routing decision overhead: Trade-off between fast matching and precise matching.
- Distributed consistency: Cache synchronization between multiple instances.
Queue-Depth Balancing
Queue management directly affects user experience and system efficiency:
Core Strategies:
- Dynamic batching: Adjust batch size based on queue length and request characteristics.
- Priority scheduling: Distinguish between real-time interactive requests and background batch requests.
- Load balancing: Intelligently distribute requests among multiple inference instances.
Key Metrics:
- P99 latency control: Ensure response time of most requests is predictable.
- Maximize throughput: Keep GPU saturated under high load.
- Fairness guarantee: Avoid long requests starving short ones.
HPA Auto-Scaling
Horizontal auto-scaling is a standard capability for cloud-native inference services:
Trigger Conditions:
- Based on GPU utilization thresholds.
- Based on queue depth and waiting time.
- Based on custom business metrics (e.g., QPS, latency SLO).
Scaling Strategies:
- Rapid scaling: Respond to traffic bursts to ensure service quality.
- Gradual scaling down: Avoid oscillations and maintain resource stability.
- Warm-up mechanism: New instances load models before receiving traffic.
P/D Disaggregated Inference Architecture
Why Separation Is Needed
The Prefill and Decode stages have distinct computational characteristics:
| Feature | Prefill Stage | Decode Stage |