Section 01
K-Token Merging: An Efficient Inference Scheme for Large Models via Latent Space Sequence Compression (Introduction)
K-Token Merging is an efficient inference scheme for long text processing in Large Language Models (LLMs). Its core idea is to merge the embedding vectors of consecutive tokens in the latent embedding space, achieving up to 75% input length compression while maintaining almost no loss in model performance. This scheme breaks through the limitations of traditional token-space compression, addresses the quadratic computational bottleneck of LLM's self-attention mechanism, and provides a new direction for efficient inference.