Section 01
[Introduction] Stream-CQSA: A New Method to Resolve Memory Bottlenecks in Attention Computation
This article introduces Stream-CQSA, a novel attention computation method based on the theory of Cyclic Quorum Sets (CQS), which can handle sequences of billions of tokens on a single GPU, completely avoid memory overflow, maintain precise attention computation without introducing approximation errors, and break through the memory bottleneck of large language models with long contexts.