Section 01
【Introduction】X-Comp: Extreme Video Token Compression Technology Breaks Through Long Video Understanding Bottlenecks
Long video understanding is a core challenge for Vision-Language Models (VLMs). Due to the large number of frames in videos and the large token count per frame, LLM context capacity is insufficient, requiring sparse sampling which loses temporal information. X-Comp achieves extreme compression of one token per frame through learnable progressive token-level compression (LP-Comp) and question-conditioned frame-level compression (QC-Comp), enabling VLMs to process 2-4 times more frames. Its accuracy increased from 42.9% to 46.2% on the LVBench benchmark, opening a new path for long video understanding.