Section 01
SiDP: A New Memory-Efficient Data Parallelism Paradigm for Offline Large Model Inference (Introduction)
SiDP is a new memory-efficient data parallelism paradigm for offline large model inference. Key points are as follows:
- Problem Solved: The conflict in offline inference scenarios between data parallelism (DP) where weight replication occupies VRAM, and model parallelism (MP) where synchronization erodes flexibility
- Core Idea: Treat model weights as bandwidth-supported shared resources, with distributed pooling management within data parallel groups
- Dual Execution Modes: Supports dynamic switching between Weight-as-a-Service (WaS) and Compute-as-a-Service (CaS)
- Performance Improvements: 1.8x increase in KV cache capacity and 1.5x improvement in end-to-end throughput on NVIDIA H20/H200/B200
Original Source: arXiv, May 27, 2026, Link: http://arxiv.org/abs/2605.28095v1