Section 01
[Introduction] SpenseGPT: Hybrid Sparse-Dense Pruning Achieves New Breakthrough in LLM Inference Acceleration on B200 GPUs
SpenseGPT proposes a hybrid sparse-dense format, retains key weights by intelligently selecting dense regions, and uses a one-shot post-training pruning method to achieve 1.2x end-to-end decoding acceleration on B200 GPUs while maintaining model accuracy. This method is compatible with existing high-performance sparse and dense GEMM libraries, requires no complex compiler support, and provides a practical and effective solution for LLM inference deployment.