Zing Forum

Reading

DWDP: A Distributed Weight Data Parallel Inference Scheme to Break Synchronization Bottlenecks, Achieving 8.8% Throughput Improvement on GB200 NVL72

DWDP enables GPUs to advance inference independently by fetching expert weights on demand and eliminating inter-layer synchronization, achieving an 8.8% end-to-end throughput improvement in DeepSeek-R1 deployment.

LLM推理MoE数据并行GB200NVL72TensorRT-LLMDeepSeekGPU优化分布式推理
Published 2026-04-02 13:00Recent activity 2026-04-03 09:18Estimated read 7 min
DWDP: A Distributed Weight Data Parallel Inference Scheme to Break Synchronization Bottlenecks, Achieving 8.8% Throughput Improvement on GB200 NVL72
1

Section 01

DWDP: Distributed Weight Data Parallelism for Breaking Sync Bottlenecks in LLM Inference

DWDP (Distributed Weight Data Parallelism) is a new inference parallelization strategy targeting Mixture-of-Experts (MoE) large language models (LLMs). It eliminates inter-layer synchronization by leveraging MoE's sparse expert activation—storing expert weights across GPUs and fetching them on demand, allowing each GPU to progress independently. This approach achieves an 8.8% end-to-end throughput improvement on DeepSeek-R1 deployed on the GB200 NVL72 platform using TensorRT-LLM, without sacrificing latency. Below is a detailed breakdown of the scheme, its implementation, and implications.

2

Section 02

The Synchronization Dilemma in Multi-GPU LLM Inference

Multi-GPU collaboration is essential for LLM inference, but traditional parallel strategies (tensor parallelism, pipeline parallelism) suffer from inter-layer synchronization. This forces all GPUs to wait for the slowest one (the bucket effect), especially when request lengths are uneven or some requests finish early. In high-concurrency scenarios, this wastes resources and hurts overall throughput and user experience.

3

Section 03

DWDP's Core: Desynchronized Data Parallelism for MoE

DWDP's key insight: MoE requests only use a subset of experts, so full weight storage per GPU isn't needed. Its mechanism includes:

  1. Weight Sharding: MoE expert weights are split across GPUs, each holding a subset.
  2. On-demand Remote Fetch: When a GPU needs an expert not locally stored, it fetches via point-to-point communication.
  3. Independent Progress: Each GPU advances inference at its own pace, no inter-layer sync. This design removes collective sync overhead, boosting robustness and resource utilization.
4

Section 04

Engineering Optimizations to Realize DWDP's Potential

Two key challenges in implementation:

  • Shard Management Overhead: Frequent remote fetches add latency. DWDP uses fine-grained sharding and local caching, plus access pattern analysis to predict needed experts and pre-prepare data.
  • Async Prefetch: To hide communication latency, DWDP prefetches next-layer expert weights asynchronously while the GPU computes the current layer—overlapping compute and communication to reduce impact on inference latency.
5

Section 05

TensorRT-LLM Implementation & Validation on GB200 NVL72

DWDP was implemented in NVIDIA's TensorRT-LLM framework and tested on GB200 NVL72 with DeepSeek-R1 (a large MoE model). Test config: input length=8K tokens, output length=1K tokens, service load=20-100 TPS/user. Results: DWDP improved end-to-end GPU throughput (TPS/GPU) by 8.8% compared to baseline, due to reduced sync wait and balanced GPU utilization.

6

Section 06

Why the 8.8% Throughput Gain Matters

The 8.8% improvement is notable for three reasons:

  1. High Baseline: TensorRT-LLM is already highly optimized—further gains are challenging.
  2. No Latency Trade-off: Unlike many optimizations, DWDP doesn't increase latency or memory usage.
  3. Scalability: As model size/GPU count grows, traditional sync schemes' load imbalance worsens—DWDP's desynchronized design will have even bigger advantages.
7

Section 07

Implications for MoE Inference & Future Research

DWDP provides new insights for MoE inference: it turns MoE's sparse activation into a parallelization advantage (instead of treating it as an exception). Future directions:

  • Dynamic Load Balancing: Adjust weight distribution based on real-time load to reduce implicit imbalance.
  • Combination with Other Parallelisms: Integrate with sequence/context parallelism for comprehensive gains.
  • Heterogeneous Hardware Support: Apply on-demand fetch to mixed GPU/CPU or cross-generation GPU deployments.
8

Section 08

Practical Considerations for Production Deployment

For teams deploying DWDP:

  • Network Topology: DWDP relies on efficient point-to-point communication—performs best on NVLink-full connected setups; PCIe/network setups may need extra optimization.
  • Memory Management: Sharded weights increase complexity—careful caching and memory allocation are needed to avoid fragmentation or OOM.
  • Observability: Desynchronized execution requires new monitoring tools since traditional performance analyzers may not work.