Section 01
【Introduction】NCCL Collective Communication Benchmark: Performance Analysis of H100 NVSwitch for Tensor Parallel LLM Inference
This project targets the tensor parallel (TP) LLM inference scenario, systematically testing the performance of NCCL collective communication primitives (all-reduce/all-gather/reduce-scatter) on an 8× H100 NVSwitch host. It covers data sizes from 8B to 8GB, compares algorithms like NVLink SHARP, Ring, and Tree, and provides quantitative references for communication optimization in TP LLM inference.