Reading

X-Comp: Extreme Video Token Compression Technology Achieves New Breakthroughs in Long Video Understanding

X-Comp achieves extreme compression of one token per frame through learnable progressive token-level compression and question-conditioned frame-level compression, enabling VLMs to process 2-4 times more frames and increasing accuracy from 42.9% to 46.2% on LVBench.

视频理解Token压缩VLMX-Comp长视频视觉语言模型注意力机制

Published 2026-04-16 01:59Recent activity 2026-04-16 11:49Estimated read 6 min

X-Comp: Extreme Video Token Compression Technology Achieves New Breakthroughs in Long Video Understanding

Section 01

【Introduction】X-Comp: Extreme Video Token Compression Technology Breaks Through Long Video Understanding Bottlenecks

Long video understanding is a core challenge for Vision-Language Models (VLMs). Due to the large number of frames in videos and the large token count per frame, LLM context capacity is insufficient, requiring sparse sampling which loses temporal information. X-Comp achieves extreme compression of one token per frame through learnable progressive token-level compression (LP-Comp) and question-conditioned frame-level compression (QC-Comp), enabling VLMs to process 2-4 times more frames. Its accuracy increased from 42.9% to 46.2% on the LVBench benchmark, opening a new path for long video understanding.

Section 02

Core Dilemmas of Long Video Understanding and Limitations of Traditional Compression

Core Contradiction of Long Video Understanding

Current VLMs face the contradiction between needing to capture dynamics from many frames and the limited context window of LLMs: A few minutes of video contains thousands of frames; if each frame generates 100 tokens, the visual part requires hundreds of thousands of tokens, exhausting context capacity and forcing sparse sampling which loses temporal information.

Limitations of Heuristic Compression

Traditional heuristic compression methods (such as frame selection based on visual similarity, fixed-interval sampling) lack downstream task awareness; a unified strategy struggles to adapt to different query needs. Moreover, they are non-learnable and cannot be optimized through training, limiting the improvement of compression effects.

Section 03

X-Comp's Two-Layer Compression Architecture: Innovative Combination of Token-Level and Frame-Level Compression

X-Comp adopts a two-layer compression architecture, combining token-level and frame-level compression:

Learnable Progressive Token-Level Compression (LP-Comp)

Convert some layers of the LLM into learnable progressive compression modules, optimize via supervised learning, and hierarchically extract features from low-level textures to high-level semantics, enabling VLMs to process 2-4 times more frames while maintaining performance.

Question-Conditioned Frame-Level Compression (QC-Comp)

Use internal attention scores of the LLM to identify frames most relevant to the query, prioritize retaining highly relevant frames, and achieve adaptive processing of the same video for different questions.

Mitigating Position Bias

Split long videos into short segments, adopt local attention mechanisms, reduce interference from long-distance dependencies, and balance global understanding and local perception.

Section 04

Performance Verification: Data-Efficient Tuning and Accuracy Improvement

X-Comp is fine-tuned based on the VideoChat-Flash model, using a data-efficient supervised compression tuning strategy: it only requires 2.5% of the data used in standard supervised fine-tuning, yet brings significant performance improvements. On the LVBench benchmark, accuracy increased from 42.9% to 46.2%, verifying that compression tuning can focus on key information and enhance understanding capabilities.

Section 05

Technical Significance and Future Application Prospects

Technical Significance

Learnable compression outperforms heuristic methods: integrated into an end-to-end training framework and optimized for tasks;
Hierarchical compression is effective: token-level and frame-level reduce redundancy from different granularities;
Adaptive processing is key: dynamically adjusting attention based on questions is more flexible and efficient.

Application Prospects

This technology is expected to be applied to long video scenarios such as video surveillance analysis, educational content understanding, and sports event commentary. In the future, VLMs will be able to process longer videos while maintaining understanding accuracy.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15