Zing Forum

Reading

X-Comp: Extreme Video Token Compression Technology Achieves New Breakthroughs in Long Video Understanding

X-Comp achieves extreme compression of one token per frame through learnable progressive token-level compression and question-conditioned frame-level compression, enabling VLMs to process 2-4 times more frames and increasing accuracy from 42.9% to 46.2% on LVBench.

视频理解Token压缩VLMX-Comp长视频视觉语言模型注意力机制
Published 2026-04-16 01:59Recent activity 2026-04-16 11:49Estimated read 6 min
X-Comp: Extreme Video Token Compression Technology Achieves New Breakthroughs in Long Video Understanding
1

Section 01

【Introduction】X-Comp: Extreme Video Token Compression Technology Breaks Through Long Video Understanding Bottlenecks

Long video understanding is a core challenge for Vision-Language Models (VLMs). Due to the large number of frames in videos and the large token count per frame, LLM context capacity is insufficient, requiring sparse sampling which loses temporal information. X-Comp achieves extreme compression of one token per frame through learnable progressive token-level compression (LP-Comp) and question-conditioned frame-level compression (QC-Comp), enabling VLMs to process 2-4 times more frames. Its accuracy increased from 42.9% to 46.2% on the LVBench benchmark, opening a new path for long video understanding.

2

Section 02

Core Dilemmas of Long Video Understanding and Limitations of Traditional Compression

Core Contradiction of Long Video Understanding

Current VLMs face the contradiction between needing to capture dynamics from many frames and the limited context window of LLMs: A few minutes of video contains thousands of frames; if each frame generates 100 tokens, the visual part requires hundreds of thousands of tokens, exhausting context capacity and forcing sparse sampling which loses temporal information.

Limitations of Heuristic Compression

Traditional heuristic compression methods (such as frame selection based on visual similarity, fixed-interval sampling) lack downstream task awareness; a unified strategy struggles to adapt to different query needs. Moreover, they are non-learnable and cannot be optimized through training, limiting the improvement of compression effects.

3

Section 03

X-Comp's Two-Layer Compression Architecture: Innovative Combination of Token-Level and Frame-Level Compression

X-Comp adopts a two-layer compression architecture, combining token-level and frame-level compression:

Learnable Progressive Token-Level Compression (LP-Comp)

Convert some layers of the LLM into learnable progressive compression modules, optimize via supervised learning, and hierarchically extract features from low-level textures to high-level semantics, enabling VLMs to process 2-4 times more frames while maintaining performance.

Question-Conditioned Frame-Level Compression (QC-Comp)

Use internal attention scores of the LLM to identify frames most relevant to the query, prioritize retaining highly relevant frames, and achieve adaptive processing of the same video for different questions.

Mitigating Position Bias

Split long videos into short segments, adopt local attention mechanisms, reduce interference from long-distance dependencies, and balance global understanding and local perception.

4

Section 04

Performance Verification: Data-Efficient Tuning and Accuracy Improvement

X-Comp is fine-tuned based on the VideoChat-Flash model, using a data-efficient supervised compression tuning strategy: it only requires 2.5% of the data used in standard supervised fine-tuning, yet brings significant performance improvements. On the LVBench benchmark, accuracy increased from 42.9% to 46.2%, verifying that compression tuning can focus on key information and enhance understanding capabilities.

5

Section 05

Technical Significance and Future Application Prospects

Technical Significance

  1. Learnable compression outperforms heuristic methods: integrated into an end-to-end training framework and optimized for tasks;
  2. Hierarchical compression is effective: token-level and frame-level reduce redundancy from different granularities;
  3. Adaptive processing is key: dynamically adjusting attention based on questions is more flexible and efficient.

Application Prospects

This technology is expected to be applied to long video scenarios such as video surveillance analysis, educational content understanding, and sports event commentary. In the future, VLMs will be able to process longer videos while maintaining understanding accuracy.