Reading

LLM-D Lambda Deployment Practice: Performance Testing of Aggregated Inference and Disaggregated Inference on NVIDIA GH200

This project conducted comprehensive tests on the aggregated inference and Prefill/Decode disaggregated inference features of LLM-D on the NVIDIA GH200 platform, covering key technologies such as prefix cache routing, queue depth balancing, HPA auto-scaling, and NIXL-based KV transmission.

LLM推理优化Prefill/Decode分离NIXLNVIDIA GH200前缀缓存自动扩缩容GPU推理大模型部署vLLM聚合推理

Published 2026-04-21 06:41Recent activity 2026-04-21 06:54Estimated read 18 min

LLM-D Lambda Deployment Practice: Performance Testing of Aggregated Inference and Disaggregated Inference on NVIDIA GH200

Section 01

Introduction / Main Floor: LLM-D Lambda Deployment Practice: Performance Testing of Aggregated Inference and Disaggregated Inference on NVIDIA GH200

Section 02

Performance Challenges of Large Model Inference

With the growth in parameter scale of Large Language Models (LLMs), performance optimization of inference services has become a core topic in AI infrastructure. Traditional monolithic inference approaches face two major bottlenecks:

Low computational resource utilization: The Prefill (prompt processing) and Decode (token generation) stages have distinct computational characteristics, leading to resource mismatch when handled uniformly.
Difficulty in balancing latency and throughput: Optimizing Time To First Token (TTFT) and overall Throughput often conflicts with each other.

The LLM-D (LLM Disaggregated Serving) architecture emerged to address these issues—by separating the Prefill and Decode stages and combining with intelligent scheduling strategies, it achieves more efficient resource utilization at the hardware level.

Project Overview

This project systematically tested and validated key features of LLM-D on the NVIDIA GH200 (Grace Hopper Superchip) platform, including:

Tested Technical Features

Aggregated Inference:
- Prefix-Cache Routing
- Queue-Depth Balancing
- HPA (Horizontal Pod Autoscaler) Auto-Scaling
P/D Disaggregated Inference (Prefill/Decode):
- NIXL-based KV Cache Transmission
- Time-Slice GPU Scheduling

Hardware Platform

NVIDIA GH200 is the core hardware for testing, with features including:

Grace CPU + Hopper GPU Unified Architecture: High-bandwidth memory sharing, extremely low CPU-GPU communication latency.
HBM3 High-Bandwidth Memory: Supports efficient inference of large models.
Transformer Engine: Hardware-level acceleration to improve inference throughput.
NVLink-C2C: Ultra-high bandwidth interconnection of 900GB/s between CPU and GPU.

Aggregated Inference Technology Details

Prefix-Cache Routing

Prefix cache is a key technology to improve efficiency in multi-turn dialogue and batch inference:

Working Principle:

Store KV caches of processed prompts in a Trie structure.
When a new request arrives, match the longest common prefix.
Reuse the matched KV cache and only compute the new part.

Performance Benefits:

Multi-turn dialogue scenarios: Subsequent round latency reduced by 50-80%.
Batch similar requests: Shared prefixes are computed only once.
Overall system throughput improvement: Reduces redundant computation and increases GPU utilization.

Implementation Challenges:

Cache management strategy: Eviction algorithm when memory is limited.
Routing decision overhead: Trade-off between fast matching and precise matching.
Distributed consistency: Cache synchronization between multiple instances.

Queue-Depth Balancing

Queue management directly affects user experience and system efficiency:

Core Strategies:

Dynamic batching: Adjust batch size based on queue length and request characteristics.
Priority scheduling: Distinguish between real-time interactive requests and background batch requests.
Load balancing: Intelligently distribute requests among multiple inference instances.

Key Metrics:

P99 latency control: Ensure response time of most requests is predictable.
Maximize throughput: Keep GPU saturated under high load.
Fairness guarantee: Avoid long requests starving short ones.

HPA Auto-Scaling

Horizontal auto-scaling is a standard capability for cloud-native inference services:

Trigger Conditions:

Based on GPU utilization thresholds.
Based on queue depth and waiting time.
Based on custom business metrics (e.g., QPS, latency SLO).

Scaling Strategies:

Rapid scaling: Respond to traffic bursts to ensure service quality.
Gradual scaling down: Avoid oscillations and maintain resource stability.
Warm-up mechanism: New instances load models before receiving traffic.

P/D Disaggregated Inference Architecture

Why Separation Is Needed

The Prefill and Decode stages have distinct computational characteristics:

| Feature | Prefill Stage | Decode Stage |

Section 03

Supplementary Viewpoint 1

Performance Challenges of Large Model Inference

Low computational resource utilization: The Prefill (prompt processing) and Decode (token generation) stages have distinct computational characteristics, leading to resource mismatch when handled uniformly.
Difficulty in balancing latency and throughput: Optimizing Time To First Token (TTFT) and overall Throughput often conflicts with each other.

Project Overview

This project systematically tested and validated key features of LLM-D on the NVIDIA GH200 (Grace Hopper Superchip) platform, including:

Tested Technical Features

Aggregated Inference:
- Prefix-Cache Routing
- Queue-Depth Balancing
- HPA (Horizontal Pod Autoscaler) Auto-Scaling
P/D Disaggregated Inference (Prefill/Decode):
- NIXL-based KV Cache Transmission
- Time-Slice GPU Scheduling

Hardware Platform

NVIDIA GH200 is the core hardware for testing, with features including:

Grace CPU + Hopper GPU Unified Architecture: High-bandwidth memory sharing, extremely low CPU-GPU communication latency.
HBM3 High-Bandwidth Memory: Supports efficient inference of large models.
Transformer Engine: Hardware-level acceleration to improve inference throughput.
NVLink-C2C: Ultra-high bandwidth interconnection of 900GB/s between CPU and GPU.

Aggregated Inference Technology Details

Prefix-Cache Routing

Prefix cache is a key technology to improve efficiency in multi-turn dialogue and batch inference:

Working Principle:

Store KV caches of processed prompts in a Trie structure.
When a new request arrives, match the longest common prefix.
Reuse the matched KV cache and only compute the new part.

Performance Benefits:

Multi-turn dialogue scenarios: Subsequent round latency reduced by 50-80%.
Batch similar requests: Shared prefixes are computed only once.
Overall system throughput improvement: Reduces redundant computation and increases GPU utilization.

Implementation Challenges:

Cache management strategy: Eviction algorithm when memory is limited.
Routing decision overhead: Trade-off between fast matching and precise matching.
Distributed consistency: Cache synchronization between multiple instances.

Queue-Depth Balancing

Queue management directly affects user experience and system efficiency:

Core Strategies:

Dynamic batching: Adjust batch size based on queue length and request characteristics.
Priority scheduling: Distinguish between real-time interactive requests and background batch requests.
Load balancing: Intelligently distribute requests among multiple inference instances.

Key Metrics:

P99 latency control: Ensure response time of most requests is predictable.
Maximize throughput: Keep GPU saturated under high load.
Fairness guarantee: Avoid long requests starving short ones.

HPA Auto-Scaling

Horizontal auto-scaling is a standard capability for cloud-native inference services:

Trigger Conditions:

Based on GPU utilization thresholds.
Based on queue depth and waiting time.
Based on custom business metrics (e.g., QPS, latency SLO).

Scaling Strategies:

Rapid scaling: Respond to traffic bursts to ensure service quality.
Gradual scaling down: Avoid oscillations and maintain resource stability.
Warm-up mechanism: New instances load models before receiving traffic.

P/D Disaggregated Inference Architecture

Why Separation Is Needed

The Prefill and Decode stages have distinct computational characteristics:

| Feature | Prefill Stage | Decode Stage |

Section 04

Supplementary Viewpoint 2

|------|-------------|------------| | Computation Mode | Compute-intensive | Memory bandwidth-intensive | | Parallelism | High (fully parallelizable) | Low (autoregressive serial) | | Memory Access | Predictable | Random access to KV cache | | Batching Efficiency | Linear with sequence length | Related to batch size | | Optimal Hardware | High-compute GPU | High-bandwidth memory |

The disaggregated architecture allows optimized resource configuration for each stage, avoiding efficiency losses from one-size-fits-all approaches.

NIXL KV Transmission Mechanism

NIXL (NVIDIA Inference XL) is a high-performance inference transport layer developed by NVIDIA, designed specifically for disaggregated inference:

Technical Features:

Zero-copy transmission: Uses GPUDirect RDMA to avoid CPU intermediation.
Low latency: Microsecond-level KV cache transmission latency.
High throughput: Supports fast migration of large-scale KV caches.
Reliability: Built-in error detection and retransmission mechanisms.

Workflow:

Prefill node completes prompt processing and generates KV cache.
Transmit KV cache to Decode node via NIXL.
Decode node starts autoregressive generation based on received KV cache.
Overlap transmission and computation to minimize pipeline bubbles.

Time-Slice GPU Scheduling

On GH200, time-slice scheduling further improves resource utilization:

Multi-tenant sharing: A single GPU serves multiple models or requests in time slices.
Preemptive scheduling: High-priority requests can interrupt low-priority tasks.
Fast context switching: Leverages Hopper architecture's context switching acceleration.

Test Methods and Result Analysis

Test Workloads

The project designed multiple typical scenarios for testing:

Interactive dialogue: Short prompts, multi-turn, low latency requirements.
Long document processing: Long context, heavy single Prefill, light Decode.
Batch generation: High throughput, acceptable higher latency.
Mixed load: Simulates request distribution in real production environments.

Key Performance Metrics

| Metric | Description | Optimization Goal |

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49