Zing Forum

Reading

LLM-D Lambda Deployment Practice: Performance Testing of Aggregated Inference and Disaggregated Inference on NVIDIA GH200

This project conducted comprehensive tests on the aggregated inference and Prefill/Decode disaggregated inference features of LLM-D on the NVIDIA GH200 platform, covering key technologies such as prefix cache routing, queue depth balancing, HPA auto-scaling, and NIXL-based KV transmission.

LLM推理优化Prefill/Decode分离NIXLNVIDIA GH200前缀缓存自动扩缩容GPU推理大模型部署vLLM聚合推理
Published 2026-04-21 06:41Recent activity 2026-04-21 06:54Estimated read 18 min
LLM-D Lambda Deployment Practice: Performance Testing of Aggregated Inference and Disaggregated Inference on NVIDIA GH200
1

Section 01

Introduction / Main Floor: LLM-D Lambda Deployment Practice: Performance Testing of Aggregated Inference and Disaggregated Inference on NVIDIA GH200

This project conducted comprehensive tests on the aggregated inference and Prefill/Decode disaggregated inference features of LLM-D on the NVIDIA GH200 platform, covering key technologies such as prefix cache routing, queue depth balancing, HPA auto-scaling, and NIXL-based KV transmission.

2

Section 02

Performance Challenges of Large Model Inference

With the growth in parameter scale of Large Language Models (LLMs), performance optimization of inference services has become a core topic in AI infrastructure. Traditional monolithic inference approaches face two major bottlenecks:

  1. Low computational resource utilization: The Prefill (prompt processing) and Decode (token generation) stages have distinct computational characteristics, leading to resource mismatch when handled uniformly.
  2. Difficulty in balancing latency and throughput: Optimizing Time To First Token (TTFT) and overall Throughput often conflicts with each other.

The LLM-D (LLM Disaggregated Serving) architecture emerged to address these issues—by separating the Prefill and Decode stages and combining with intelligent scheduling strategies, it achieves more efficient resource utilization at the hardware level.

Project Overview

This project systematically tested and validated key features of LLM-D on the NVIDIA GH200 (Grace Hopper Superchip) platform, including:

Tested Technical Features

  1. Aggregated Inference:

    • Prefix-Cache Routing
    • Queue-Depth Balancing
    • HPA (Horizontal Pod Autoscaler) Auto-Scaling
  2. P/D Disaggregated Inference (Prefill/Decode):

    • NIXL-based KV Cache Transmission
    • Time-Slice GPU Scheduling

Hardware Platform

NVIDIA GH200 is the core hardware for testing, with features including:

  • Grace CPU + Hopper GPU Unified Architecture: High-bandwidth memory sharing, extremely low CPU-GPU communication latency.
  • HBM3 High-Bandwidth Memory: Supports efficient inference of large models.
  • Transformer Engine: Hardware-level acceleration to improve inference throughput.
  • NVLink-C2C: Ultra-high bandwidth interconnection of 900GB/s between CPU and GPU.

Aggregated Inference Technology Details

Prefix-Cache Routing

Prefix cache is a key technology to improve efficiency in multi-turn dialogue and batch inference:

Working Principle:

  • Store KV caches of processed prompts in a Trie structure.
  • When a new request arrives, match the longest common prefix.
  • Reuse the matched KV cache and only compute the new part.

Performance Benefits:

  • Multi-turn dialogue scenarios: Subsequent round latency reduced by 50-80%.
  • Batch similar requests: Shared prefixes are computed only once.
  • Overall system throughput improvement: Reduces redundant computation and increases GPU utilization.

Implementation Challenges:

  • Cache management strategy: Eviction algorithm when memory is limited.
  • Routing decision overhead: Trade-off between fast matching and precise matching.
  • Distributed consistency: Cache synchronization between multiple instances.

Queue-Depth Balancing

Queue management directly affects user experience and system efficiency:

Core Strategies:

  • Dynamic batching: Adjust batch size based on queue length and request characteristics.
  • Priority scheduling: Distinguish between real-time interactive requests and background batch requests.
  • Load balancing: Intelligently distribute requests among multiple inference instances.

Key Metrics:

  • P99 latency control: Ensure response time of most requests is predictable.
  • Maximize throughput: Keep GPU saturated under high load.
  • Fairness guarantee: Avoid long requests starving short ones.

HPA Auto-Scaling

Horizontal auto-scaling is a standard capability for cloud-native inference services:

Trigger Conditions:

  • Based on GPU utilization thresholds.
  • Based on queue depth and waiting time.
  • Based on custom business metrics (e.g., QPS, latency SLO).

Scaling Strategies:

  • Rapid scaling: Respond to traffic bursts to ensure service quality.
  • Gradual scaling down: Avoid oscillations and maintain resource stability.
  • Warm-up mechanism: New instances load models before receiving traffic.

P/D Disaggregated Inference Architecture

Why Separation Is Needed

The Prefill and Decode stages have distinct computational characteristics:

| Feature | Prefill Stage | Decode Stage |

3

Section 03

Supplementary Viewpoint 1

Performance Challenges of Large Model Inference

With the growth in parameter scale of Large Language Models (LLMs), performance optimization of inference services has become a core topic in AI infrastructure. Traditional monolithic inference approaches face two major bottlenecks:

  1. Low computational resource utilization: The Prefill (prompt processing) and Decode (token generation) stages have distinct computational characteristics, leading to resource mismatch when handled uniformly.
  2. Difficulty in balancing latency and throughput: Optimizing Time To First Token (TTFT) and overall Throughput often conflicts with each other.

The LLM-D (LLM Disaggregated Serving) architecture emerged to address these issues—by separating the Prefill and Decode stages and combining with intelligent scheduling strategies, it achieves more efficient resource utilization at the hardware level.

Project Overview

This project systematically tested and validated key features of LLM-D on the NVIDIA GH200 (Grace Hopper Superchip) platform, including:

Tested Technical Features

  1. Aggregated Inference:

    • Prefix-Cache Routing
    • Queue-Depth Balancing
    • HPA (Horizontal Pod Autoscaler) Auto-Scaling
  2. P/D Disaggregated Inference (Prefill/Decode):

    • NIXL-based KV Cache Transmission
    • Time-Slice GPU Scheduling

Hardware Platform

NVIDIA GH200 is the core hardware for testing, with features including:

  • Grace CPU + Hopper GPU Unified Architecture: High-bandwidth memory sharing, extremely low CPU-GPU communication latency.
  • HBM3 High-Bandwidth Memory: Supports efficient inference of large models.
  • Transformer Engine: Hardware-level acceleration to improve inference throughput.
  • NVLink-C2C: Ultra-high bandwidth interconnection of 900GB/s between CPU and GPU.

Aggregated Inference Technology Details

Prefix-Cache Routing

Prefix cache is a key technology to improve efficiency in multi-turn dialogue and batch inference:

Working Principle:

  • Store KV caches of processed prompts in a Trie structure.
  • When a new request arrives, match the longest common prefix.
  • Reuse the matched KV cache and only compute the new part.

Performance Benefits:

  • Multi-turn dialogue scenarios: Subsequent round latency reduced by 50-80%.
  • Batch similar requests: Shared prefixes are computed only once.
  • Overall system throughput improvement: Reduces redundant computation and increases GPU utilization.

Implementation Challenges:

  • Cache management strategy: Eviction algorithm when memory is limited.
  • Routing decision overhead: Trade-off between fast matching and precise matching.
  • Distributed consistency: Cache synchronization between multiple instances.

Queue-Depth Balancing

Queue management directly affects user experience and system efficiency:

Core Strategies:

  • Dynamic batching: Adjust batch size based on queue length and request characteristics.
  • Priority scheduling: Distinguish between real-time interactive requests and background batch requests.
  • Load balancing: Intelligently distribute requests among multiple inference instances.

Key Metrics:

  • P99 latency control: Ensure response time of most requests is predictable.
  • Maximize throughput: Keep GPU saturated under high load.
  • Fairness guarantee: Avoid long requests starving short ones.

HPA Auto-Scaling

Horizontal auto-scaling is a standard capability for cloud-native inference services:

Trigger Conditions:

  • Based on GPU utilization thresholds.
  • Based on queue depth and waiting time.
  • Based on custom business metrics (e.g., QPS, latency SLO).

Scaling Strategies:

  • Rapid scaling: Respond to traffic bursts to ensure service quality.
  • Gradual scaling down: Avoid oscillations and maintain resource stability.
  • Warm-up mechanism: New instances load models before receiving traffic.

P/D Disaggregated Inference Architecture

Why Separation Is Needed

The Prefill and Decode stages have distinct computational characteristics:

| Feature | Prefill Stage | Decode Stage |

4

Section 04

Supplementary Viewpoint 2

|------|-------------|------------| | Computation Mode | Compute-intensive | Memory bandwidth-intensive | | Parallelism | High (fully parallelizable) | Low (autoregressive serial) | | Memory Access | Predictable | Random access to KV cache | | Batching Efficiency | Linear with sequence length | Related to batch size | | Optimal Hardware | High-compute GPU | High-bandwidth memory |

The disaggregated architecture allows optimized resource configuration for each stage, avoiding efficiency losses from one-size-fits-all approaches.

NIXL KV Transmission Mechanism

NIXL (NVIDIA Inference XL) is a high-performance inference transport layer developed by NVIDIA, designed specifically for disaggregated inference:

Technical Features:

  • Zero-copy transmission: Uses GPUDirect RDMA to avoid CPU intermediation.
  • Low latency: Microsecond-level KV cache transmission latency.
  • High throughput: Supports fast migration of large-scale KV caches.
  • Reliability: Built-in error detection and retransmission mechanisms.

Workflow:

  1. Prefill node completes prompt processing and generates KV cache.
  2. Transmit KV cache to Decode node via NIXL.
  3. Decode node starts autoregressive generation based on received KV cache.
  4. Overlap transmission and computation to minimize pipeline bubbles.

Time-Slice GPU Scheduling

On GH200, time-slice scheduling further improves resource utilization:

  • Multi-tenant sharing: A single GPU serves multiple models or requests in time slices.
  • Preemptive scheduling: High-priority requests can interrupt low-priority tasks.
  • Fast context switching: Leverages Hopper architecture's context switching acceleration.

Test Methods and Result Analysis

Test Workloads

The project designed multiple typical scenarios for testing:

  1. Interactive dialogue: Short prompts, multi-turn, low latency requirements.
  2. Long document processing: Long context, heavy single Prefill, light Decode.
  3. Batch generation: High throughput, acceptable higher latency.
  4. Mixed load: Simulates request distribution in real production environments.

Key Performance Metrics

| Metric | Description | Optimization Goal |