Section 01
STS Method Overview: Achieving 2.67x Inference Speedup at 90% Sparsity
STS (Sparse Attention Mechanism Combined with Speculative Decoding) dynamically constructs sparse masks using attention scores from a draft model. Without retraining large language models, it achieves 90% sparsity and a 2.67x inference speedup, providing a new solution to break through the quadratic complexity bottleneck of attention in long-context processing.