Section 01
[Introduction] SpecSA: An Efficient LLM Inference Framework Integrating Speculative Decoding and Sparse Attention
SpecSA is an efficient LLM inference framework that integrates speculative decoding and dynamic sparse attention. It addresses the structural mismatch when combining the two through three key technologies: overlap-aware grouped query execution, refresh/reuse NSA kernel fusion, and configuration-guided adaptive orchestration, achieving up to a 3.49x end-to-end throughput improvement.