Section 01
DWDP: Distributed Weight Data Parallelism for Breaking Sync Bottlenecks in LLM Inference
DWDP (Distributed Weight Data Parallelism) is a new inference parallelization strategy targeting Mixture-of-Experts (MoE) large language models (LLMs). It eliminates inter-layer synchronization by leveraging MoE's sparse expert activation—storing expert weights across GPUs and fetching them on demand, allowing each GPU to progress independently. This approach achieves an 8.8% end-to-end throughput improvement on DeepSeek-R1 deployed on the GB200 NVL72 platform using TensorRT-LLM, without sacrificing latency. Below is a detailed breakdown of the scheme, its implementation, and implications.