Zing Forum

Reading

SpenseGPT: Hybrid Sparse-Dense Pruning Achieves New Breakthrough in LLM Inference Acceleration on B200 GPUs

SpenseGPT proposes a hybrid sparse-dense format, retains key weights by intelligently selecting dense regions, and achieves 1.2x end-to-end decoding acceleration on B200 GPUs while maintaining model accuracy.

模型剪枝稀疏计算LLM推理B200 GPU模型压缩半结构化稀疏后训练优化
Published 2026-06-09 13:48Recent activity 2026-06-10 10:59Estimated read 7 min
SpenseGPT: Hybrid Sparse-Dense Pruning Achieves New Breakthrough in LLM Inference Acceleration on B200 GPUs
1

Section 01

[Introduction] SpenseGPT: Hybrid Sparse-Dense Pruning Achieves New Breakthrough in LLM Inference Acceleration on B200 GPUs

SpenseGPT proposes a hybrid sparse-dense format, retains key weights by intelligently selecting dense regions, and uses a one-shot post-training pruning method to achieve 1.2x end-to-end decoding acceleration on B200 GPUs while maintaining model accuracy. This method is compatible with existing high-performance sparse and dense GEMM libraries, requires no complex compiler support, and provides a practical and effective solution for LLM inference deployment.

2

Section 02

Real-World Dilemmas of Model Compression: Challenges of Sparse Pruning

As the scale of large language models expands, inference costs have become a deployment barrier. Model pruning is an important solution, but it faces many challenges:

  1. Cost of strict sparsity constraints: 2:4 sparsity enforces a 50% sparsity rate, which easily leads to accuracy loss;
  2. Limitations of alternative solutions: Relaxed sparse formats either require specialized compilers or introduce runtime overhead;
  3. Difficulty in end-to-end acceleration: Even if sparse computation is faster, bottlenecks like memory bandwidth may offset the gains.
3

Section 03

Core Methods of SpenseGPT: Hybrid Sparse-Dense Format and One-Shot Pruning

Spense Hybrid Sparse-Dense Format

Divide the weight matrix into two regions:

  • 2:4 sparse region: Use hardware-accelerated sparse computation;
  • Dense region: Retain key weights and use standard dense computation. Advantages: Flexible sparsity rate, compatibility with existing libraries, no input activation expansion.

Dense Region Selection Strategy

  1. Importance heuristic: Select the most important weights based on indicators like weight magnitude;
  2. Pattern structuring: Select rows/columns that have a significant impact on performance.

SpenseGPT Workflow

  1. Analyze weight importance;
  2. Partition into sparse/dense regions;
  3. Apply 2:4 sparsification to the sparse region and keep the dense region intact;
  4. Optional lightweight fine-tuning to restore accuracy.

Advantages of one-shot pruning: Fast deployment, low cost, plug-and-play pre-trained models.

4

Section 04

Experimental Validation: Real Acceleration Effect on B200 GPUs

Validated on Qwen3-32B and Seed-OSS-36B models:

  • End-to-end acceleration: B200 GPUs achieve 1.2x decoding acceleration with FP8 precision while maintaining model accuracy;
  • Significance: First time to achieve end-to-end LLM acceleration on B200 via semi-structured sparse tensor cores;
  • Why not 2x: Overhead of dense regions, memory bandwidth bottlenecks, switching overhead of hybrid computation.
5

Section 05

Technical Contributions and Industry Significance: A Practical and Effective LLM Acceleration Solution

Practical Level

  • First demonstration of end-to-end acceleration of one-shot pruning on latest GPUs like B200;
  • Compatible with existing GEMM libraries, no need for complex compilers or special runtimes;
  • Can be directly applied to open-source models, providing a practical solution for the community.

Methodological Level

  • Demonstrate the potential of hybrid sparse-dense format;
  • Provide a reusable framework for intelligent partitioning strategies;
  • Prove the practical value of one-shot post-training pruning, lowering the threshold for compression.
6

Section 06

Limitations and Future Directions: Room for Further Optimization

Limitations

  • The acceleration magnitude is still far from the theoretical upper limit;
  • Validation is focused on 30B+ models, and the effect on small/large models remains to be verified;
  • Task coverage is limited (mainly general language capabilities);
  • Dependent on 2:4 sparse format.

Future Directions

  • Develop more intelligent dense region selection algorithms;
  • Explore dynamic sparsity rate adjustment;
  • Combine with other compression techniques like quantization;
  • Expand to more hardware platforms.
7

Section 07

Conclusion: Practical Value and Milestone Significance of SpenseGPT

SpenseGPT seeks a balance between efficiency and accuracy under real-world constraints. Although its 1.2x acceleration is not extreme, it represents an important milestone: proving that real performance gains can be obtained through intelligent compression on the latest hardware. For enterprises and developers, it is a simple, effective, and ready-to-use option for LLM inference acceleration—every bit of efficiency improvement means a cost advantage.